❖ XGCSynthesizer

❖ SDV Enterprise Bundle. This feature is available as part of the XSynthesizers Bundle, an optional add-on to SDV Enterprise. For more information, please visit the XSynthesizers Bundle page.

The XGCSynthesizer stands for eXtraGaussianCopula. It uses classic, statistical methods to train a model and generate synthetic data similar to the GaussianCopulaSynthesizer. However, it contains some additional features for higher quality modeling.

from sdv.single_table import XGCSynthesizer

synthesizer = XGCSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in a Metadata object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = XGCSynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    numerical_distributions={
        'amenities_fee': 'beta',
        'checkin_date': 'scipy.stats.dweibull'
    },
    default_distribution='norm'
)

Parameter Reference

enforce_min_max_values: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

(default) True

The synthetic data will contain numerical values that are within the ranges of the real data.

False

The synthetic data may contain numerical values that are less than or greater than the real data.

enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data

(default) True

The synthetic data will be rounded to the same number of decimal digits that were observed in the real data

False

The synthetic data may contain more decimal digits than were observed in the real data

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

For example ["en_US", "fr_CA"].

For all options, see the Faker docs.

numerical_distributions: Set the distribution shape of any numerical columns that appear in your table. Input this as a dictionary, where the key is the name of the numerical column and the values is a numerical distribution.

numerical_distributions = {
    <column name>: 'norm',
    <column name>: 'uniform', 
    ...
}

(default) None

Use the default distribution for the column name.

<dictionary>

Apply the given distribution to each column name. The distribution name should be one of: 'norm' 'beta', 'truncnorm', 'uniform', 'gamma' or 'gaussian_kde'

'scipy.stats'.<distribution_name>

Use a continuous distribution from the scipy library. Make sure to provide the full path, including the prefix scipy.stats. — for example 'scipy.stats.dweibull' to refer to scipy's dweibull distribution.

default_distribution: Set the distribution shape to use by default for all columns. Input this as a single string.

(default) 'beta'

Model the column as a beta distribution

<distribution_name>

Model the column as the given distribution. The distribution name should be one of: 'norm' 'beta', 'truncnorm', 'uniform', 'gamma' or 'gaussian_kde'

'scipy.stats'.<distribution_name>

Setting the distribution to 'gaussian_kde' increases the time it takes to train your synthesizer.

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()

{
    'enforce_min_max_values': True,
    'enforce_rounding': False
    'default_distribution': 'beta',
    'numerical_distributions': {
        'amenities_fee': 'beta',
        'checkin_date': 'scipy.stats.dweibull'
    },
    ...
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

Output A Metadata object

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

(required) data: A pandas DataFrame object containing the real data that the machine learning model will learn from

Output (None)

synthesizer.fit(data)

Technical Details: This synthesizer uses Gaussian Copulas to learn the overall distribution of the real data. This happens in two stages:

Learning the distribution of each individual column, also known as the marginal distribution. For example a beta distribution with α=2 and β=5. The synthesizer uses the learned distribution to normalize the values, creating normal curves with µ=0 and σ=1.
Learning the covariance of each pair of normalized columns. This is stored as an n x n matrix, where n is the number of columns in the table.

The Synthetic Data Vault paper has more information about the Gaussian normalization process and the Copula estimations.

get_learned_distributions

After fitting this synthesizer, you can access the marginal distributions that were learned to estimate the shape of each column.

Parameters None

Output A dictionary that maps the name of each learned column to the distribution that estimates its shape

synthesizer.get_learned_distributions()

{
    'amenities_fee': {
        'distribution': 'beta',
        'learned_parameters': { 'a': 2.22, 'b': 3.17, 'loc': 0.07, 'scale': 48.5 }
    },
    'checkin_date': { 
        'distribution': 'scipy.stats.dweibull',
        'learned_parameters': {...}
    },
    ...
}

For more information about the distributions and their parameters, visit the Copulas library.

Learned parameters are only available for parametric distributions. For eg. you will not be able to access learned distributions for the 'gaussian_kde' technique.

In some cases, the synthesizer may not be able to fit the exact distribution shape you requested, so you may see another distribution shape as a fallback (eg. 'truncnorm' instead of 'beta').

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

(required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

load (utility function)

Use this utility function to load a trained synthesizer from a Python pickle file. After loading your synthesizer, you'll be able to sample synthetic data from it.

Parameters

(required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer object

from sdv.utils import load_synthesizer

synthesizer = load_synthesizer(
    filepath='my_synthesizer.pkl'
)

This utility function works for any SDV synthesizer.

What's next?

After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.

Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

For more details, see Customizations.

FAQs

What happens if columns don't contain numerical data?

This synthesizer models non-numerical columns, including columns with missing values.

Although the Gaussian Copula algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see Advanced Features.

Why is 'beta' the default distribution & when should I change it?

To create high quality synthetic data, the synthesizer should be able to match the shape of data for some optimal set of parameters. (The synthesizer learns and optimizes the parameters.)

We chose 'beta' as the default distribution because it's capable of matching a variety of different shapes. It's also time efficient compared to other distributions like 'gaussian_kde'.

This default is not guaranteed to work on every dataset. Consider changing the default distribution if all your columns have specific characteristics that you want to capture. If you have only a few columns that are highly important to match, then you can set those shapes specifically using the numerical_distributions parameter.

Can I call fit again even if I've previously fit some data?

Yes, even if you're previously fit data, you should be able to call the fit method again.

If you do this, the synthesizer will start over from scratch and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.

PreviousTVAESynthesizer Next❖ BootstrapSynthesizer

Last updated 18 days ago