Transformation

After you have set the customization options, you can finish processing the transformer and begin using it to transform between raw and numerical data.

Processing

fit()

In the fit stage, the HyperTransformer references the config you set in the previous step while learning from your data values.

Parameters

This method performs more intensive computations and may take some time to run. To avoid any errors, make sure that the data matches the config.

Output (None)

Examples

ht.fit(customers)

If you ever change your config, you must re-run fit to see any changes to the transformations.

Transforming into numerical data & back

transform()

Use the transform method to transform all the columns in your dataset at once.

Parameters

To avoid any errors, make sure that the data matches the config

Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.

transformed_customers = ht.transform(customers)

fit_transform()

In many cases you will want to fit and transform the same data. As a shortcut, you can use this method to do both at once.

Parameters

To avoid any errors, make sure that the data matches the config

Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.

Examples

transformed_customers = ht.fit_transform(customers)

reverse_transform()

Use this method to recover data in the same format as the original. This method works just like transform but in reverse.

Parameters

Output A pandas DataFrame with the reverse transformed data. This data has the same column names and format as the original data.

The output data will be in the same format as the original data but the exact values may not be the same as the original. This depends on the exact transformations you used. Some transformers can recover the original data exactly. Others intentionally do not, for example for privacy reasons.

Examples

reversed_customers = ht.reverse_transform(transformed_customers)

Transforming a subset of the data

In some cases, you may only have access to a subset of columns from the original dataset. In this case, you can use the HyperTransformer to transform only a few columns, and not the full dataset.

transform_subset()

Use this method to transform a dataset that contains only a subset of the columns that were in the original data.

Parameters

  • (required) data: A pandas DataFrame object with that contains your data. The data contains a subset of the columns that were in the original dataset.

Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.

# a subset means you only have some of the original columns
customer_subset = customers[['age', 'credit_card']]
transformed_subset = ht.transform_subset(customer_subset)

reverse_transform_subset()

Use this method to reverse transform a dataset that contains only a subset of the columns that were in the original data.

Parameters

  • (required) data: A pandas DataFrame object containing transformed data. The data contains a subset of the overall columns.

Output A pandas.DataFrame with the reverse transformed data. This data has the same column names and format as the original data.

Examples

reversed_subset_customers = ht.reverse_transform_subset(transformed_subset)

Anonymization

Instead of transforming data, your use case might require a fully anonymizing certain columns. You may also need to control the randomness during this process.

create_anonymized_columns()

Use this method to anonymize columns from scratch.

Parameters

  • (required) num_rows: An integer >0 that describes the number of rows you want to create

  • (required) column_names: A list of strings representing the column names that you want to create. Each column in this list must be assigned to either AnonymizedFaker or RegexGenerator in order to work. If you want to use other transformers, you'll need to reverse transform the data intead.

Output A pandas DataFrame that contains anonymized data for each column name for the desired number of rows

Examples

anonymized_data = ht.create_anonymized_columns(
  num_rows=100,
  column_names=['student_id', 'address']
)

Controlling Randomization

Transformers may require some randomness during any of the methods above. In some cases, you may want to control this to guarantee that you get the same data for different runs.

reset_randomization()

Use this method to reset the random seed that the transformers use. After using this method, any fitting, transformation or anonymization request you make will be the same as before.

Parameters None

Output None

Examples

In this example, calling reset randomization will mean that reversed_data1 and reversed_data3 are equivalent.

ht.reset_randomization()
reversed_data1 = ht.reverse_transform(data)
reversed_data2 = ht.reverse_transform(data)

ht.reset_randomization()
reversed_data3 = ht.reverse_transform(data)

The same will be true for any other call such as transform, fit or create_anonymized_columns.

Last updated