CopulaGANSynthesizer
The Copula GAN Synthesizer uses a mix classic, statistical methods and GAN-based deep learning methods to train a model and generate synthetic data.
This is an experimental synthesizer! Let us know if you're finding the modeling process and synthetic data creation useful.
from sdv.single_table import CopulaGANSynthesizer
synthesizer = CopulaGANSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_rows=10)
Creating a synthesizer
When creating your synthesizer, you are required to pass in a Metadata object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.
synthesizer = CopulaGANSynthesizer(
metadata, # required
enforce_min_max_values=True,
enforce_rounding=False,
numerical_distributions={
'amenities_fee': 'beta',
'checkin_date': 'uniform'
},
epochs=500,
verbose=True
)
Parameter Reference
enforce_min_max_values
: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data
(default) True
The synthetic data will contain numerical values that are within the ranges of the real data.
False
The synthetic data may contain numerical values that are less than or greater than the real data.
enforce_rounding
: Control whether the synthetic data should have the same number of decimal digits as the real data
(default) True
The synthetic data will be rounded to the same number of decimal digits that were observed in the real data
False
The synthetic data may contain more decimal digits than were observed in the real data
locales
: A list of locale strings. Any PII columns will correspond to the locales that you provide.
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
For example [
"en_US"
,
"fr_CA"
]
.
For all options, see the Faker docs.
numerical_distributions
: Set the distribution shape of any numerical columns that appear in your table. Input this as a dictionary, where the key is the name of the numerical column and the values is a numerical distribution.
numerical_distributions = {
<column name>: 'norm',
<column name>: 'uniform',
...
}
(default) None
Use the default distribution for the column name.
<dictionary>
Apply the given distribution to each column name. The distribution name should be one of: 'norm'
'beta'
, 'truncnorm'
, 'uniform'
, 'gamma'
or 'gaussian_kde'
default_distribution
: Set the distribution shape to use by default for all columns. Input this as a single string.
(default) 'beta'
Model the column as a beta distribution
<distribution_name>
Model the column as the given distribution. The distribution name should be one of: 'norm'
'beta'
, 'truncnorm'
, 'uniform'
, 'gamma'
or 'gaussian_kde'
Setting the distribution to 'gaussian_kde'
increases the time it takes to train your synthesizer.
epochs
: Number of times to train the GAN. Each new epoch can improve the model.
(default) 300
Run all the data through the Generator and Discriminator 300 times during training
<number>
Train for a different number of epochs. Note that larger numbers will increase the modeling time.
verbose
: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.
(default) False
Do not print out any results
True
Print out the Generator and Discriminator loss values per epoch. The loss values indicate how well the GAN is currently performing, lower values indicating higher quality.
cuda
: Whether to use CUDA, a parallel computing platform that allows you to speed up modeling time using the GPU
(default) True
If available, use CUDA to speed up modeling time. If it's not available, then there will be no difference.
False
Do not use CUDA to speed up modeling time.
get_parameters
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters None
Output A dictionary with the parameter names and the values
synthesizer.get_parameters()
{
'enforce_min_max_values': True
'enforce_rounding': False,
'epochs': 500,
'verbose': True,
'numerical_distributions': {
'amenities_fee': 'beta',
'checkin_date': 'uniform'
},
...
}
get_metadata
Use this function to access the metadata object that you have included for the synthesizer
Parameters None
Output A Metadata object
metadata = synthesizer.get_metadata()
Learning from your data
To learn a machine learning model based on your real data, use the fit
method.
fit
Parameters
(required)
data
: A pandas DataFrame object containing the real data that the machine learning model will learn from
Output (None)
get_learned_distributions
After fitting this synthesizer, you can access the marginal distributions that were learned to estimate the shape of each column.
Parameters None
Output A dictionary that maps the name of each learned column to the distribution that estimates its shape
synthesizer.get_learned_distributions()
{
'amenities_fee': {
'distribution': 'beta',
'learned_parameters': { 'a': 2.22, 'b': 3.17, 'loc': 0.07, 'scale': 48.5 }
},
'checkin_date': {
...
},
...
}
For more information about the distributions and their parameters, visit the Copulas library.
get_loss_values
After fitting, you can access the loss values computed during each epoch for both the numerator and denominator.
Parameters (None)
Output A pandas.DataFrame object containing epoch number, generator loss value and discriminator loss value.
synthesizer.get_loss_values()
Epoch Generator Loss Discriminator Loss
1 1.7863 -0.3639
2 1.5484 0.2260
3 1.3633 -0.0441
...
Saving your synthesizer
Save your trained synthesizer for future use.
save
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required)
filepath
: A string describing the filepath where you want to save your synthesizer. Make sure this ends in.pkl
Output (None) The file will be saved at the desired location
synthesizer.save(
filepath='my_synthesizer.pkl'
)
CopulaGANSynthesizer.load
Use this function to load a trained synthesizer from a Python pickle file
Parameters
(required)
filepath
: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a CopulaGANSynthesizer
object
from sdv.single_table import CopulaGANSynthesizer
synthesizer = CopulaGANSynthesizer.load(
filepath='my_synthesizer.pkl'
)
What's next?
After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.
FAQs
Last updated