TVAESynthesizer
Last updated
Last updated
Copyright (c) 2023, DataCebo, Inc.
The TVAE Synthesizer uses a variational autoencoder (VAE)-based, neural network techniques to train a model and generate synthetic data.
When creating your synthesizer, you are required to pass in a Metadata object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.
enforce_min_max_values
: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data
enforce_rounding
: Control whether the synthetic data should have the same number of decimal digits as the real data
locales
: A list of locale strings. Any PII columns will correspond to the locales that you provide.
epochs
: Number of times to train the VAE. Each new epoch can improve the model.
verbose
: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.
cuda
: Whether to use CUDA, a parallel computing platform that allows you to speed up modeling time using the GPU
Looking for more customizations? Other settings are available to fine-tune the architecture of the neural network used to model the data. Click the section below to expand.
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters None
Output A dictionary with the parameter names and the values
The returned parameters are a copy. Changing them will not affect the synthesizer.
Use this function to access the metadata object that you have included for the synthesizer
Parameters None
Output A Metadata object
The returned metadata is a copy. Changing it will not affect the synthesizer.
To learn a machine learning model based on your real data, use the fit
method.
Parameters
(required) data
: A pandas DataFrame object containing the real data that the machine learning model will learn from
Output (None)
Technical Details: This synthesizer uses the TVAE to learn a model from real data and create synthetic data. The TVAE uses variational autoencoders (VAEs) to model data, as described in the Modeling Tabular data using Conditional GAN paper which was presented at the NeurIPS conference in 2019.
After fitting, you can access the loss values computed during each epoch and batch.
Parameters (None)
Output A pandas.DataFrame object containing epoch number, batch number and loss value.
Save your trained synthesizer for future use.
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required) filepath
: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl
Output (None) The file will be saved at the desired location
Use this function to load a trained synthesizer from a Python pickle file
Parameters
(required) filepath
: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a TVAESynthesizer
object
After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.
Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.
For more details, see Customizations.
(default) True
The synthetic data will contain numerical values that are within the ranges of the real data.
False
The synthetic data may contain numerical values that are less than or greater than the real data. Note that you can still set the limits on individual columns using Constraints.
(default) True
The synthetic data will be rounded to the same number of decimal digits that were observed in the real data
False
The synthetic data may contain more decimal digits than were observed in the real data
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
For example [
"en_US"
,
"fr_CA"
]
.
For all options, see the Faker docs.
(default) 300
Run all the data through the VAE 300 times during training
<number>
Train for a different number of epochs. Note that larger numbers will increase the modeling time.
(default) False
Do not print out any results
True
Print out the loss values per epoch. The loss values indicate how well the VAE is currently performing, lower values indicate higher quality.
(default) True
If available, use CUDA to speed up modeling time. If it's not available, then there will be no difference.
False
Do not use CUDA to speed up modeling time.