The CTGAN Synthesizer uses GAN-based, deep learning methods to train a model and generate synthetic data.
from sdv.single_table import CTGANSynthesizersynthesizer =CTGANSynthesizer(metadata)synthesizer.fit(data)synthetic_data = synthesizer.sample(num_rows=10)
Creating a synthesizer
When creating your synthesizer, you are required to pass in a Metadata object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.
enforce_min_max_values: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data
enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data
locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.
epochs: Number of times to train the GAN. Each new epoch can improve the model.
verbose: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.
cuda: Whether to use CUDA, a parallel computing platform that allows you to speed up modeling time using the GPU
Looking for more customizations? Other settings are available to fine-tune the architecture of the underlying GAN used to model the data. Click the section below to expand.
Click to expand additional GAN customization options
These settings are specific to the GAN. Use these settings if you want to optimize the technical architecture and modeling.
batch_size: Number of data samples to process in each step. This value must be even, and it must be divisible by the pac parameter (see below). Defaults to 500.
discriminator_dim: Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_decay: Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.
discriminator_lr: Learning rate for the discriminator. Defaults to 2e-4.
discriminator_steps: Number of discriminator updates to do for each generator update. Default 1 to match the original CTGAN implementation
embedding_dim: Size of the random sample passed to the Generator. (Default 128)
generator_decay: Generator weight decay for the Adam Optimizer. Defaults to 1e-6
generator_dim: Size of the output samples for each one of the Residuals. A Residual Layer will be created for each one of the values provided. Defaults to (256, 256).
generator_lr: Learning rate for the generator. Defaults to 2e-4.
log_frequency: Whether to use log frequency of categorical levels in conditional sampling. Defaults to True.
pac: Number of samples to group together when applying the discriminator. Defaults to 10.
get_parameters
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters None
Output A dictionary with the parameter names and the values
The returned metadata is a copy. Changing it will not affect the synthesizer.
Learning from your data
To learn a machine learning model based on your real data, use the fit method.
fit
Parameters
(required) data: A pandas DataFrame object containing the real data that the machine learning model will learn from
Output (None)
synthesizer.fit(data)
Technical Details: This synthesizer uses the CTGAN to learn a model from real data and create synthetic data. The CTGAN uses generative adversarial networks (GANs) to model data, as described in the Modeling Tabular data using Conditional GAN paper which was presented at the NeurIPS conference in 2019.
get_loss_values
After fitting, you can access the loss values computed during each epoch for both the generator and discriminator.
Parameters (None)
Output A pandas.DataFrame object containing epoch number, generator loss value and discriminator loss value.
synthesizer.get_loss_values()
Epoch Generator Loss Discriminator Loss11.7863-0.363921.54840.226031.3633-0.0441...
get_loss_values_plot
After fitting, you can plot the loss values at each epoch, for both the generator and discriminator.
Parameters (None)
Output A plotly Figure object that plots the loss values per epoch
Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl
Output (None) The file will be saved at the desired location
synthesizer.save( filepath='my_synthesizer.pkl')
CTGANSynthesizer.load
Use this function to load a trained synthesizer from a Python pickle file
Parameters
(required) filepath: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a CTGANSynthesizer object
from sdv.single_table import CTGANSynthesizersynthesizer = CTGANSynthesizer.load( filepath='my_synthesizer.pkl')
What's next?
After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.
Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.
What happens if columns don't contain numerical data?
This synthesizer models non-numerical columns, including columns with missing values.
Although the CTGAN algorithm is designed for complete data with non-missing values, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see Advanced Features.
How do I tune the hyperparameters? (Such as epochs or other values)
Unfortunately, there is no one-size-fits-all solution for this question! The optimal hyperparameters depend on both the complexity of your dataset and the metrics you are using to quantify success.
Epochs is a well-studied parameter. Our experiments suggest that increasing the number of epochs helps up until a certain inflection point. After this, there is no significant improvement. Keep in mind that increasing the epochs also increases the training time. More information is available in this discussion and blog post.
For other hyperparameters, you will have to do some experimentation yourself. You may have luck using external hyperparameter tuning libraries. Usually, these libraries test a combination of hyperparameters to determine the best set for your desired goal or metric.
The verbose setting is reporting negative loss. Is that ok?
Yes, both the generator and discriminator are trained to minimize the loss value, which often becomes negative. Our experiments suggest that the generator loss frequently tends to become more negative as the epochs progress, while the discriminator loss tends to fluctuate at around 0.
Other trends are possible for different datasets. But look out for cases where the generator and discriminator loss values are not converging -- this may indicate that you need to modify the GAN's architecture or that your data is not suitable for CTGAN.
Can I call fit again even if I've previously fit some data?
Yes, even if you're previously fit data, you should be able to call the fit method again.
If you do this, the synthesizer will start over from scratch and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.
What is the difference between CTGANSynthesizer and the CTGAN Library?
The CTGANSynthesizer is part of the SDV library. It provides an end-to-end workflow for creating synthetic data, which includes:
Pre-processing the data and handing constraints
Running the core ML algorithm (CTGAN) on the preprocessed data
Post-processing synthetic data to the correct format and specifications
Step #2 uses the CTGAN Library, which contains just the core machine learning algorithm (GAN implementation).
How do I cite CTGAN?
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{ctgan,
title={Modeling Tabular data using Conditional GAN},
author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
booktitle={Advances in Neural Information Processing Systems},
year={2019}
}
The synthetic data will contain numerical values that are within the ranges of the real data
False
The synthetic data may contain numerical values that are less than or greater than the real data. Note that you can still set the limits on individual columns using Constraints.
(default) True
The synthetic data will be rounded to the same number of decimal digits that were observed in the real data
False
The synthetic data may contain more decimal digits than were observed in the real data
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
Run all the data through the Generator and Discriminator 300 times during training
<number>
Train for a different number of epochs. Note that larger numbers will increase the modeling time.
(default) False
Do not print out any results
True
Print out the Generator and Discriminator loss values per epoch. The loss values indicate how well the GAN is currently performing, lower values indicating higher quality.
(default) True
If available, use CUDA to speed up modeling time. If it's not available, then there will be no difference.