# CopulaGANSynthesizer

The Copula GAN Synthesizer uses a mix classic, statistical methods and GAN-based deep learning methods to train a model and generate synthetic data.

**This is an experimental synthesizer! **Let us know if you're finding the modeling process and synthetic data creation useful.

## Creating a synthesizer

When creating your synthesizer, you are required to pass in a Metadata object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

### Parameter Reference

: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data**enforce_min_max_values**

: Control whether the synthetic data should have the same number of decimal digits as the real data**enforce_rounding**

: A list of locale strings. Any PII columns will correspond to the locales that you provide.**locales**

: Set the distribution shape of any numerical columns that appear in your table. Input this as a dictionary, where the key is the name of the numerical column and the values is a numerical distribution.**numerical_distributions**

Possible options are:

(default)

`None`

: Use the default distribution for the column name.One of:

`'norm'`

`'beta'`

,`'truncnorm'`

,`'uniform'`

,`'gamma'`

or`'gaussian_kde'`

: Set the distribution shape to use by default for all columns. Input this as a single string.**default_distribution**

(default)

`'beta'`

: Model the column as a beta distributionOne of:

`'norm'`

`'beta'`

,`'truncnorm'`

,`'uniform'`

,`'gamma'`

or`'gaussian_kde'`

Setting the distribution to `'gaussian_kde'`

increases the time it takes to train your synthesizer.

: Number of times to train the GAN. Each new epoch can improve the model.**epochs**

: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.**verbose**

: Whether to use CUDA, a parallel computing platform that allows you to speed up modeling time using the GPU**cuda**

**Looking for more customizations? **Other settings are available to fine-tune the architecture of the underlying GAN used to model the data. Click the section below to expand.

### get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

**Parameters** None

**Output** A dictionary with the parameter names and the values

The returned parameters are a copy. Changing them will not affect the synthesizer.

### get_metadata

Use this function to access the metadata object that you have included for the synthesizer

**Parameters** None

**Output** A Metadata object

The returned metadata is a copy. Changing it will not affect the synthesizer.

## Learning from your data

To learn a machine learning model based on your real data, use the `fit`

method.

### fit

**Parameters**

(required)

`data`

: A pandas DataFrame object containing the real data that the machine learning model will learn from

**Output** (None)

**Technical Details: **This synthesizer learns the marginal distributions of the real data columns and normalizes them. Then, it uses CTGAN to learn the normalized data. This takes place in two stages, as shown below.

**Statistical Learning**: The synthesizer learns the distribution (shape) of each individual column, also known as the 1D or marginal distribution. For example a beta distribution with α=2 and β=5. The synthesizer uses the learned distribution to normalize the values, creating normal curves with µ=0 and σ=1. The Synthetic Data Vault paper has more information about the Gaussian normalization process.**GAN-based Learning**: This synthesizer uses CTGAN to train the normalized data. The CTGAN uses generative adversarial networks (GANs) to model data, as described in the Modeling Tabular data using Conditional GAN paper which was presented at the NeurIPS conference in 2019.

### get_learned_distributions

After fitting this synthesizer, you can access the marginal distributions that were learned to estimate the shape of each column.

**Parameters **None

**Output** A dictionary that maps the name of each learned column to the distribution that estimates its shape

For more information about the distributions and their parameters, visit the Copulas library.

Learned parameters are only available for parametric distributions. For eg. you will not be able to access learned distributions for the `'gaussian_kde'`

technique.

In some cases, the synthesizer may not be able to fit the exact distribution shape you requested, so you may see another distribution shape (eg. `'truncnorm'`

instead of `'beta'`

).

### get_loss_values

After fitting, you can access the loss values computed during each epoch for both the numerator and denominator.

**Parameters **(None)

**Output** A pandas.DataFrame object containing epoch number, generator loss value and discriminator loss value.

## Saving your synthesizer

Save your trained synthesizer for future use.

### save

Use this function to save your trained synthesizer as a Python pickle file.

**Parameters**

(required)

`filepath`

: A string describing the filepath where you want to save your synthesizer. Make sure this ends in`.pkl`

**Output **(None) The file will be saved at the desired location

### CopulaGANSynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file

**Parameters**

(required)

`filepath`

: A string describing the filepath of your saved synthesizer

**Output** Your synthesizer, as a `CopulaGANSynthesizer`

object

## What's next?

After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.

**Want to improve your synthesizer?** Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

For more details, see Customizations.

## FAQs

Last updated