Preprocessing
Last updated
Last updated
All synthesizers pre-process your data to prepare it for machine learning and then post-process synthetic data to convert it into the original format. Advanced features allow you to control the transformations that are applied to each column.
Do you have any sensitive data? The transformations are also used for anonymizing or pseudo-anonymizing sensitive data.
The SDV will pre and post-process your data by using reversible data transformations for each column. The transformers are available in our .
The methods below allow you to see which transformations will be applied to each column based on the synthesizer and data you plan to use.
Let the synthesizer auto assign the transformations based on the data you'd like to use for modeling.
Parameters
(required) data
: A object containing the real data that the machine learning model will learn from
Output (None)
After assigning the transformers, you can get more details about the transformers that will be used on each column.
Parameters (None)
Output A dictionary mapping each column name to an that will pre and post-process your data.
After assigning transformers, you can also modify them to customize the pre- and post-processing.
Parameters
(required) column_name_to_transformer
: A dictionary mapping the column name to the new RDT transformer object that you'd like to use for pre and post processing.
Output (None)
Be careful with this step! Make sure you are updating transformers that are compatible with each column's sdtype and make sense for the underlying machine learning model. For example, certain synthesizers like GaussianCopulaSynthesizer
require that the data is fully numerical, without any missing values.
Parameters
(required) sdtype
: A string with the name of the sdtype that you want to change the transformers for
(required) transformer_name
: A string with the name of the transformer to use.
transformer_parameters
: A dictionary that maps the name of the transformer parameter (string) to the parameter value. Use this if you want to override the default settings.
(default) None
: Do not override the default settings
After modifying the transformations, you can apply them to the real data in a step-wise fashion or all at once.
Use this function to preprocess the data according to the transformations. After preprocessing, you should have numerical data that is ready for modeling.
Parameters
Use this function to perform model training on preprocessed data. Note that this step may take a while, as it requires machine learning to learn trends from your the dataset.
Parameters
Use this function to perform the preprocessing and modeling training all in one step. Internally, this uses both the preprocess
and fit_processed_data
functions in succession.
Parameters
Use this method to reassign new transformers that you'd like to use for specific columns. You can apply any transformer in the to your data.
Use this method to reassign new transformers for all columns of a specific sdtype. For example, reassign the processing for all numerical or categorical columns. You can apply any transformer in the to your data.
Use the or for sensitive data.
(required) data
: A object with the real data that you want to preprocess
Output A object with the preprocessed version of the real data. Based on the transformations, some columns may be converted, added or removed.
(required) processed_data
: A object with data that is ready for machine learning. (This is the output of the preprocess
method.)
Output (None). This steps trains the model using machine learning. After this step, you are ready to create synthetic data. For more details, see .
(required) data
: A object with the real data that you want the synthesizer to learn from
Output (None) After this step, you are ready to create synthetic data. For more details, see .