All synthesizers pre-process your data to prepare it for machine learning and then post-process synthetic data to convert it into the original format. These advanced feature allow you to control the transformations that are applied to each column.
Do you have any sensitive data? The transformations are also used for anonymizing or pseudo-anonymizing sensitive data.
The methods below allow you to see which transformations will be applied to each column based on the synthesizer and data you plan to use.
Let the synthesizer auto assign the transformations based on the data you'd like to use for modeling.
After assigning the transformers, you can get more details about the transformers that will be used on each column.
table_name: A string with the table name you'd like to see the transformer for
'guest_email': AnonymizedFaker(provider_name='internet', function_name='email', enforce_uniqueness=True),
'amenities_fee': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
'checkin_date': UnixTimestampEncoder(datetime_format='%d %b %Y'),
'checkout_date': UnixTimestampEncoder(datetime_format='%d %b %Y'),
'room_rate': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
'billing_address': AnonymizedFaker(provider_name='address', function_name='address'),
'credit_card_number': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', enforce_uniqueness=True)
After assigning transformers, you can also modify them to customize the pre- and post-processing.
table_name: A string representing the table name yo'd like to update the transformer for
column_name_to_transformer: A dictionary mapping the column name to the new RDT transformer object that you'd like to use for pre and post processing.
from rdt.transformers import FloatFormatter
Be careful with this step! Make sure you are updating transformers that are compatible with each column's sdtype and make sense for the underlying machine learning model.
from rdt.transformers import PseudoAnonymizedFaker
'credit_card_number': PsuedoAnonymizedFaker(provider_name='credit_card', function_name='credit_card_number'),
Anonymization vs. Pseudo-anonymization
Pseudo-anonymization preserves a mapping between the real, sensitive values and the fake, synthetic data. Use this if you'd like to trace back the synthetic data to real values.
Anonymization is not reversible. Anyone with access to the fake, synthetic data will not be able to it back to any value of real data.
After modifying the transformations, you can apply them to the real data in a step-wise fashion or all at once.
Use this function to preprocess the data according to the transformations. After preprocessing, you should have numerical data that is ready for modeling.
processed_data = synthesizer.preprocess(real_data)
Use this function to perform model training on preprocessed data. Note that this step may take a while, as it requires machine learning to learn trends from your the dataset.
Use this function to perform the preprocessing and modeling training all in one step. Internally, this uses both the
fit_processed_datafunctions in succession.
In some cases, your data may already be processed such that the synthesizer does not need to perform additional transformations. You can remove transformers by assigning them to
Based on the data you're using, the synthesizer may determine that no transformations are necessary for machine learning. In this case, you will see that the column is not assigned to
None, meaning that the data will not undergo any transformation for the machine learning model.