# Empirical Differential Privacy

{% hint style="info" %}
❖ **SDV Enterprise Bundle**. This feature is available as part of the **Differential Privacy Bundle**, an optional add-on to SDV Enterprise. For more information, please visit the [Differential Privacy Bundle](https://docs.sdv.dev/sdv/explore/sdv-bundles/differential-privacy) page.
{% endhint %}

[Differential privacy](https://docs.sdv.dev/sdv/explore/sdv-bundles/differential-privacy) is a mathematically-rigorous framework that you can use to create private synthetic data. Using our evaluation tool, you can empirically verify the differential privacy that a synthesizer algorithm is offering for a dataset.

## How does it work?

In the differential privacy setup, we are interested in measuring the impact that 1 row of training data has on the overall parameters that a synthesizer learns. Depending on the synthesizer's exact algorithm, the parameters may not be easily accessible or interpretable. Instead, we can create synthetic data using the synthesizer and assume that the patterns exhibited by the synthetic data reflect the parameters.

&#x20;Our evaluation setup creates multiple synthesizers:

* First, we train a synthesizer on all of the real, training data,
* Then, we remove a single row of training data, and use it to train a new synthesizer

We can compare the synthetic data that the synthesizers produce. An algorithm with *high differential privacy* will produce similar synthetic data despite the removal of a row — no matter which row is removed.

<figure><img src="https://1967107441-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FfNxEeZzl9uFiJ4Zf4BRZ%2Fuploads%2FpCUsVpB83aZlEZtSzoTB%2FDP-evaluation-docs-graphic_May%2015%202025.png?alt=media&#x26;token=44028476-50fd-410e-91ae-89042d5473f6" alt=""><figcaption><p>The differential privacy setup measure the effect that 1 row of training data has on the synthesizer's parameters. We proxy the synthesizer's parameters by producing synthetic data instead.</p></figcaption></figure>

In the SDV's setup, we compare the statistical differences in the different synthetic datas using the [quality score](https://docs.sdv.dev/sdv/single-table-data/evaluation/data-quality). (But in reality, we could use any statistical measure.) We repeated this process many times, leaving out a different row each time. The differential privacy score represents the *worst case scenario* that we measure when leaving out a row of real data.

## API & Usage

### Verify your DP synthesizer

If you are using a synthesizer specifically designed to offer differential privacy — such as [DPGCSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/dpgcsynthesizer), or [DPGCFlexSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/dpgcflexsynthesizer) — it's important to verify the privacy that the synthesizer is able to offer on your dataset. Use the `verify_differential_privacy` method on your synthesizer object and pass in the original data you used during fit.

```python
privacy_score = my_dpgc_synthesizer.verify_differential_privacy(
    data=my_dataframe,
    num_rows_synthetic_data=1000000,
    num_rows_test=10,
    test_data_seed=42,
    verbose=True
)
```

{% hint style="warning" %}
**Measuring differential privacy may take some time.** This empirical measure trains multiple synthesizers. Depending on the synthesizer algorithm, the size of the dataset, and the number of rows you'd like to test, the overall differential privacy measure may take significant time and computing resources. We recommend starting with a smaller dataset and smaller set of test rows.
{% endhint %}

**Parameters**:

* (required) `data`: A pandas.DataFrame containing the real data for training the synthesizer
* `num_rows_synthetic_data`: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.
  * (default) `1000000`: Create 1 million rows of synthetic data each time we train a synthesizer
* `num_rows_test`: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. *The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.*
  * (default) `20`: Choose 20 rows to leave out (1 at a time) and measure differential privacy.
* `test_data_seed`: A seed to use to deterministically pick the rows to test
  * (default) `None`: Do not set a seed. Different rows may be left out each time you call this evaluation tool
* `verbose`: Whether to show progress.
  * (default) `True`: Show a progress bar for each row that is tested
  * `False`: Do not show a progress bar

**Returns**: A privacy score representing the empirical differential privacy using the synthesizer algorithm for the given dataset. The score ranges from 0 to 1, describing the impact that 1 row of training data has on the synthesizer.

* **(best) 1.0**: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.
* **(worst) 0.0**: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.

{% hint style="success" %}
**SDVerified stamp of approval.** After running this function, your synthesizer will have recorded the fact that you have verified it.

```python
my_dpgc_synthesizer.is_verified()
```

```python
{
    'differential_privacy_verified': True
}
```

At this point, you can save your synthesizer object as a file and share it with others.<br>

```python
my_dpgc_synthesizer.save('my_dpgc_synthesizer.pkl')
```

{% endhint %}

### Measure DP on any synthesizer

Use the `measure_differential_privacy` tool to empirically measure the differential privacy of any synthesizer algorithm on a dataset. You can supply any [single-table SDV synthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers) for this evaluation.

```python
from sdv.evaluation.single_table import measure_differential_privacy

privacy_score = measure_differential_privacy(
    data=my_dataframe,
    metadata=my_metadata,
    synthesizer_name='GaussianCopulaSynthesizer',
    synthesizer_parameters={ 'default_distribution': 'norm' },
    num_rows_synthetic_data=1000000,
    num_rows_test=10,
    test_data_seed=42,
    verbose=True
)
```

{% hint style="warning" %}
**Measuring differential privacy may take some time.** This empirical measure trains multiple synthesizers. Depending on the synthesizer algorithm, the size of the dataset, and the number of rows you'd like to test, the overall differential privacy measure may take significant time and computing resources. We recommend starting with a smaller dataset and smaller set of test rows.
{% endhint %}

**Parameters**:

* (required) `data`: A pandas.DataFrame containing the real data for training the synthesizer
* (required) `metadata`: An [SDV Metadata](https://docs.sdv.dev/sdv/concepts/metadata) object that describes your data
* (required) `synthesizer_name`: A string with the name of the synthesizer algorithm to use. You can choose from any of the [single-table SDV synthesizers](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers) that you have access to.
* `synthesizer_parameters`: A dictionary with the parameters to pass into the synthesizer. Use this to fine-tune the synthesizer algorithm.
  * (default) `None`: Use the default parameters for the given synthesizer
  * `<dict>`: A dictionary of parameters to use to fine-tune the synthesizer algorithm. The keys represent the parameter names, and the values are the parameter values.
* `num_rows_synthetic_data`: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.
  * (default) `1000000`: Create 1 million rows of synthetic data each time we train a synthesizer
* `num_rows_test`: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. *The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.*
  * (default) `20`: Choose 20 rows to leave out (1 at a time) and measure differential privacy.
* `test_data_seed`: A seed to use to deterministically pick the rows to test
  * (default) `None`: Do not set a seed. Different rows may be left out each time you call this evaluation tool
* `verbose`: Whether to show progress.
  * (default) `True`: Show a progress bar for each row that is tested
  * `False`: Do not show a progress bar

**Returns**: A privacy score representing the empirical differential privacy using the synthesizer algorithm for the given dataset. The score ranges from 0 to 1, describing the impact that 1 row of training data has on the synthesizer.

* **(best) 1.0**: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.
* **(worst) 0.0**: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdv/single-table-data/evaluation/privacy/empirical-differential-privacy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
