Transformation
After you have set the customization options, you can finish processing the transformer and begin using it to transform between raw and numerical data.
Processing
fit()
In the fit
stage, the HyperTransformer references the config you set in the previous step while learning from your data values.
Parameters
(required)
data
: A pandas DataFrame object that contains your data
This method performs more intensive computations and may take some time to run. To avoid any errors, make sure that the data matches the config.
Output (None)
Examples
If you ever change your config, you must re-run fit
to see any changes to the transformations.
Transforming into numerical data & back
transform()
Use the transform
method to transform all the columns in your dataset at once.
Parameters
(required)
data
: A pandas DataFrame object with that contains your data.
To avoid any errors, make sure that the data matches the config
Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.
fit_transform()
In many cases you will want to fit
and transform
the same data. As a shortcut, you can use this method to do both at once.
Parameters
(required)
data
: A pandas DataFrame object that contains your data
To avoid any errors, make sure that the data matches the config
Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.
Examples
reverse_transform()
Use this method to recover data in the same format as the original. This method works just like transform
but in reverse.
Parameters
(required)
data
: A pandas DataFrame object containing transformed data
Output A pandas DataFrame with the reverse transformed data. This data has the same column names and format as the original data.
The output data will be in the same format as the original data but the exact values may not be the same as the original. This depends on the exact transformations you used. Some transformers can recover the original data exactly. Others intentionally do not, for example for privacy reasons.
Examples
Transforming a subset of the data
In some cases, you may only have access to a subset of columns from the original dataset. In this case, you can use the HyperTransformer to transform only a few columns, and not the full dataset.
transform_subset()
Use this method to transform a dataset that contains only a subset of the columns that were in the original data.
Parameters
(required)
data
: A pandas DataFrame object with that contains your data. The data contains a subset of the columns that were in the original dataset.
Output A new pandas DataFrame with the transformed data. This DataFrame has fully numerical data that can be used for your data science projects.
reverse_transform_subset()
Use this method to reverse transform a dataset that contains only a subset of the columns that were in the original data.
Parameters
(required)
data
: A pandas DataFrame object containing transformed data. The data contains a subset of the overall columns.
Output A pandas.DataFrame with the reverse transformed data. This data has the same column names and format as the original data.
Examples
Anonymization
Instead of transforming data, your use case might require a fully anonymizing certain columns. You may also need to control the randomness during this process.
create_anonymized_columns()
Use this method to anonymize columns from scratch.
Parameters
(required)
num_rows
: An integer >0 that describes the number of rows you want to create(required)
column_names
: A list of strings representing the column names that you want to create. Each column in this list must be assigned to either AnonymizedFaker or RegexGenerator in order to work. If you want to use other transformers, you'll need to reverse transform the data intead.
Output A pandas DataFrame that contains anonymized data for each column name for the desired number of rows
Examples
Controlling Randomization
Transformers may require some randomness during any of the methods above. In some cases, you may want to control this to guarantee that you get the same data for different runs.
reset_randomization()
Use this method to reset the random seed that the transformers use. After using this method, any fitting, transformation or anonymization request you make will be the same as before.
Parameters None
Output None
Examples
In this example, calling reset randomization will mean that reversed_data1
and reversed_data3
are equivalent.
The same will be true for any other call such as transform
, fit
or create_anonymized_columns
.
Last updated