# CopulaGANSynthesizer

The Copula GAN Synthesizer uses a mix classic, statistical methods and GAN-based deep learning methods to train a model and generate synthetic data.

{% hint style="warning" %}
**This is an experimental synthesizer!** [Let us know](https://github.com/sdv-dev/SDV/issues) if you're finding the modeling process and synthetic data creation useful.
{% endhint %}

```python
from sdv.single_table import CopulaGANSynthesizer

synthesizer = CopulaGANSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)
```

## Creating a synthesizer

When creating your synthesizer, you are required to pass in a [Metadata](https://docs.sdv.dev/sdv/single-table-data/data-preparation/creating-metadata) object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

```python
synthesizer = CopulaGANSynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    numerical_distributions={
        'amenities_fee': 'beta',
        'checkin_date': 'uniform'
    },
    epochs=500,
    verbose=True
)
```

### Parameter Reference

**`enforce_min_max_values`**: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>The synthetic data will contain numerical values that are within the ranges of the real data.</td></tr><tr><td><code>False</code></td><td>The synthetic data may contain numerical values that are less than or greater than the real data. </td></tr></tbody></table>

**`enforce_rounding`**: Control whether the synthetic data should have the same number of decimal digits as the real data

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>The synthetic data will be rounded to the same number of decimal digits that were observed in the real data</td></tr><tr><td><code>False</code></td><td>The synthetic data may contain more decimal digits than were observed in the real data</td></tr></tbody></table>

**`locales`**: A list of locale strings. Any PII columns will correspond to the locales that you provide.

<table data-header-hidden><thead><tr><th width="218"></th><th></th></tr></thead><tbody><tr><td>(default) <code>['en_US']</code></td><td>Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)</td></tr><tr><td><code>&#x3C;list></code></td><td><p>Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.</p><p></p><p>For example <code>[</code><a href="https://faker.readthedocs.io/en/master/locales/en_US.html"><code>"en_US"</code></a><code>,</code> <a href="https://faker.readthedocs.io/en/master/locales/fr_CA.html"><code>"fr_CA"</code></a><code>]</code>. </p><p>For all options, see the <a href="https://faker.readthedocs.io/en/master/locales.html">Faker docs</a>.</p></td></tr></tbody></table>

**`numerical_distributions`**: Set the distribution shape of any numerical columns that appear in your table. Input this as a dictionary, where the key is the name of the numerical column and the values is a numerical distribution.

```python
numerical_distributions = {
    <column name>: 'norm',
    <column name>: 'uniform', 
    ...
}
```

<table data-header-hidden><thead><tr><th width="218"></th><th></th></tr></thead><tbody><tr><td>(default) <code>None</code></td><td>Use the default distribution for the column name.</td></tr><tr><td><code>&#x3C;dictionary></code></td><td>Apply the given distribution to each column name. The distribution name should be one of: <code>'norm'</code> <code>'beta'</code>, <code>'truncnorm'</code>, <code>'uniform'</code>, <code>'gamma'</code> or <code>'gaussian_kde'</code> </td></tr></tbody></table>

**`default_distribution`**: Set the distribution shape to use by default for all columns. Input this as a single string.

<table data-header-hidden><thead><tr><th width="204.0625"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'beta'</code></td><td>Model the column as a beta distribution</td></tr><tr><td><code>&#x3C;distribution_name></code></td><td>Model the column as the given distribution. The distribution name should be one of: <code>'norm'</code> <code>'beta'</code>, <code>'truncnorm'</code>, <code>'uniform'</code>, <code>'gamma'</code> or <code>'gaussian_kde'</code> </td></tr></tbody></table>

{% hint style="warning" %}
Setting the distribution to `'gaussian_kde'` increases the time it takes to train your synthesizer.
{% endhint %}

**`epochs`**: Number of times to train the GAN. Each new epoch can improve the model.

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>300</code></td><td>Run all the data through the Generator and Discriminator 300 times during training</td></tr><tr><td><code>&#x3C;number></code></td><td>Train for a different number of epochs. Note that larger numbers will increase the modeling time.</td></tr></tbody></table>

**`verbose`**: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>False</code></td><td>Do not print out any results</td></tr><tr><td><code>True</code></td><td>Print out the Generator and Discriminator loss values per epoch. The loss values indicate how well the GAN is currently performing, lower values indicating higher quality.</td></tr></tbody></table>

**`enable_gpu`**: Whether to enable GPU usage when training the synthesizer. This may speed up the modeling time.

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>If available, use the GPU to speed up modeling time. Currently, this will look for either <a href="https://developer.nvidia.com/how-to-cuda-python">CUDA</a> (on Linux/Windows machines) or <a href="https://developer.apple.com/metal/pytorch/">MPS</a> (on Mac).  If this is not available, then the GPU will not be used.  </td></tr><tr><td><code>False</code></td><td>Do not use the GPU to speed up modeling time.</td></tr></tbody></table>

{% hint style="info" %}
**Is my synthesizer using the GPU?** After calling `fit`, you may notice that the GPU is not immediately used. This is because GPU is not used for the initial, data preprocessing step. After data preprocessing is complete, you should then see the GPU being used for the neural network training. For more information about data preprocessing, see the [this guide](https://docs.sdv.dev/sdv/single-table-data/modeling/customizations/preprocessing).
{% endhint %}

*(deprecated) `cuda`: Please use the `enable_gpu` option to use CUDA, if it's available on your platform.*

{% hint style="info" %}
**Looking for more customizations?** Other settings are available to fine-tune the architecture of the underlying GAN used to model the data. Click the section below to expand.
{% endhint %}

<details>

<summary>Click to expand additional GAN customization options</summary>

These settings are specific to the GAN. Use these settings if you want to optimize the technical architecture and modeling.

**`batch_size`**: Number of data samples to process in each step. This value must be even, and it must be divisible by the `pac` parameter (see below). Defaults to `500`.&#x20;

**`discriminator_dim`**: Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to `(256, 256)`.

**`discriminator_decay`**: Discriminator weight decay for the Adam Optimizer. Defaults to `1e-6`.

**`discriminator_lr`**: Learning rate for the discriminator. Defaults to `2e-4`.

**`discriminator_steps`**: Number of discriminator updates to do for each generator update. Default `1` to match the original CTGAN implementation

**`embedding_dim`**: Size of the random sample passed to the Generator. (Default `128`)

**`generator_decay`**: Generator weight decay for the Adam Optimizer. Defaults to `1e-6`

**`generator_dim`**: Size of the output samples for each one of the Residuals. A Residual Layer will be created for each one of the values provided. Defaults to `(256, 256)`.

**`generator_lr`**: Learning rate for the generator. Defaults to `2e-4`.

**`log_frequency`**: Whether to use log frequency of categorical levels in conditional sampling. Defaults to `True`.

**`pac`**: Number of samples to group together when applying the discriminator. Defaults to 10.

</details>

### get\_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

**Parameters** None

**Output** A dictionary with the parameter names and the values

```python
synthesizer.get_parameters()
```

```python
{
    'enforce_min_max_values': True
    'enforce_rounding': False,
    'epochs': 500,
    'verbose': True,
    'numerical_distributions': {
        'amenities_fee': 'beta',
        'checkin_date': 'uniform'
    },
    ...
}
```

{% hint style="info" %}
The returned parameters are a copy. Changing them will not affect the synthesizer.
{% endhint %}

### get\_metadata

Use this function to access the metadata object that you have included for the synthesizer

**Parameters** None

**Output** A [Metadata](https://docs.sdv.dev/sdv/concepts/metadata) object

```python
metadata = synthesizer.get_metadata()
```

{% hint style="info" %}
The returned metadata is a copy. Changing it will not affect the synthesizer.
{% endhint %}

## Learning from your data

To learn a machine learning model based on your real data, use the `fit` method.

### fit

**Parameters**

* (required) `data`: A [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the real data that the machine learning model will learn from

**Output** (None)

{% hint style="info" %}
**Technical Details:** This synthesizer learns the marginal distributions of the real data columns and normalizes them. Then, it uses CTGAN to learn the normalized data. This takes place in two stages, as shown below.

1. **Statistical Learning**: The synthesizer learns the distribution (shape) of each individual column, also known as the 1D or marginal distribution. For example a beta distribution with α=2 and β=5. The synthesizer uses the learned distribution to normalize the values, creating normal curves with µ=0 and σ=1. The [Synthetic Data Vault paper](https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf) has more information about the Gaussian normalization process.
2. **GAN-based Learning**: This synthesizer uses CTGAN to train the normalized data. The CTGAN uses generative adversarial networks (GANs) to model data, as described in the [Modeling Tabular data using Conditional GAN](https://arxiv.org/pdf/1907.00503.pdf) paper which was presented at the NeurIPS conference in 2019.
   {% endhint %}

### get\_learned\_distributions

After fitting this synthesizer, you can access the marginal distributions that were learned to estimate the shape of each column.

**Parameters** None

**Output** A dictionary that maps the name of each learned column to the distribution that estimates its shape

```python
synthesizer.get_learned_distributions()
```

```
{
    'amenities_fee': {
        'distribution': 'beta',
        'learned_parameters': { 'a': 2.22, 'b': 3.17, 'loc': 0.07, 'scale': 48.5 }
    },
    'checkin_date': { 
        ...
    },
    ...
}
```

For more information about the distributions and their parameters, visit the[ Copulas library](https://sdv.dev/Copulas/).

{% hint style="info" %}
Learned parameters are only available for parametric distributions. For eg. you will not be able to access learned distributions for the `'gaussian_kde'` technique.

In some cases, the synthesizer may not be able to fit the exact distribution shape you requested, so you may see another distribution shape (eg. `'truncnorm'` instead of `'beta'`).
{% endhint %}

### get\_loss\_values

After fitting, you can access the loss values computed during each epoch for both the numerator and denominator.

**Parameters** (None)

**Output** A pandas.DataFrame object containing epoch number, generator loss value and discriminator loss value.

```python
synthesizer.get_loss_values()
```

```python
Epoch  Generator Loss  Discriminator Loss
1      1.7863          -0.3639
2      1.5484          0.2260
3      1.3633          -0.0441
...
```

## Saving your synthesizer

Save your trained synthesizer for future use.

### save

Use this function to save your trained synthesizer as a Python pickle file.

**Parameters**

* (required) `filepath`: A string describing the filepath where you want to save your synthesizer. Make sure this ends in `.pkl`&#x20;

**Output** (None) The file will be saved at the desired location

```python
synthesizer.save(
    filepath='my_synthesizer.pkl'
)
```

### load (utility function)

Use this utility function to load a trained synthesizer from a Python pickle file. After loading your synthesizer, you'll be able to sample synthetic data from it.

**Parameters**

* (required) `filepath`: A string describing the filepath of your saved synthesizer

**Output** Your synthesizer object

```python
from sdv.utils import load_synthesizer

synthesizer = load_synthesizer(
    filepath='my_synthesizer.pkl'
)
```

*This utility function works for any SDV synthesizer.*

## What's next?

After training your synthesizer, you can now sample synthetic data. See the [Sampling](https://docs.sdv.dev/sdv/single-table-data/sampling) section for more details.

```python
synthetic_data = synthesizer.sample(num_rows=10)
```

{% hint style="info" %}
**Want to improve your synthesizer?** Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

For more details, see [Customizations](https://docs.sdv.dev/sdv/single-table-data/modeling/customizations).
{% endhint %}

## FAQs

<details>

<summary>What happens if columns don't contain numerical data?</summary>

This synthesizer models non-numerical columns, including columns with missing values.

Although the Gaussian Copula algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see [Advanced Features](https://docs.sdv.dev/sdv/single-table-data/modeling/customizations).

</details>

<details>

<summary>Why is <code>'beta'</code> the default distribution &#x26; when should I change it?</summary>

To create high quality synthetic data, the distribution should be able to match the shape of data for some optimal set of parameters. (The synthesizer learns and optimizes the parameters.)

We chose `'beta'` as the default distribution because it can take different characteristics based on the parameters, which means it's capable of matching a variety of different shapes. It's also time efficient compared to other distributions like `'gaussian_kde'`.

**This default is not guaranteed to work on every dataset.** Consider changing the default distribution if all your columns have specific characteristics that you want to capture. If you have only a few columns that are highly important to match, then you can set those shapes specifically using the `numerical_distributions` parameter.

</details>

<details>

<summary>Can I call <code>fit</code> again even if I've previously fit some data?</summary>

Yes, even if you're previously fit data, you should be able to call the `fit` method again.

If you do this, the synthesizer will **start over from scratch** and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.

</details>

<details>

<summary>How do I cite CopulaGAN?</summary>

The CopulaGAN is a hybrid of the Gaussian Copula and the CTGAN algorithms. The [Gaussian Copula](https://en.wikipedia.org/wiki/Copula_\(probability_theory\)) is a well-known statistical approach. The CTGAN is a research project that you can cite using the following text:

*Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni*. **Modeling Tabular data using Conditional GAN**. NeurIPS, 2019.

```
@inproceedings{ctgan,
   title={Modeling Tabular data using Conditional GAN},
   author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
   booktitle={Advances in Neural Information Processing Systems},
   year={2019}
}
```

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
