Empirical Differential Privacy

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the Differential Privacy Bundle page.

Differential privacy is a mathematically-rigorous framework that you can use to create private synthetic data. Using our evaluation tool, you can empirically verify the differential privacy that a synthesizer algorithm is offering for a dataset.

How does it work?

In the differential privacy setup, we are interested in measuring the impact that 1 row of training data has on the overall parameters that a synthesizer learns. Depending on the synthesizer's exact algorithm, the parameters may not be easily accessible or interpretable. Instead, we can create synthetic data using the synthesizer and assume that the patterns exhibited by the synthetic data reflect the parameters.

Our evaluation setup creates multiple synthesizers:

First, we train a synthesizer on all of the real, training data,
Then, we remove a single row of training data, and use it to train a new synthesizer

We can compare the synthetic data that the synthesizers produce. An algorithm with high differential privacy will produce similar synthetic data despite the removal of a row — no matter which row is removed.

In the SDV's setup, we compare the statistical differences in the different synthetic datas using the quality score. (But in reality, we could use any statistical measure.) We repeated this process many times, leaving out a different row each time. The differential privacy score represents the worst case scenario that we measure when leaving out a row of real data.

API & Usage

Verify your DP synthesizer

If you are using a synthesizer specifically designed to offer differential privacy — such as DPGCSynthesizer, or DPGCFlexSynthesizer — it's important to verify the privacy that the synthesizer is able to offer on your dataset. Use the verify_differential_privacy method on your synthesizer object and pass in the original data you used during fit.

privacy_score = my_dpgc_synthesizer.verify_differential_privacy(
    data=my_dataframe,
    num_rows_synthetic_data=1000000,
    num_rows_test=10,
    test_data_seed=42,
    verbose=True
)

Measuring differential privacy may take some time. This empirical measure trains multiple synthesizers. Depending on the synthesizer algorithm, the size of the dataset, and the number of rows you'd like to test, the overall differential privacy measure may take significant time and computing resources. We recommend starting with a smaller dataset and smaller set of test rows.

Parameters:

(required) data: A pandas.DataFrame containing the real data for training the synthesizer
num_rows_synthetic_data: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.
- (default) 1000000: Create 1 million rows of synthetic data each time we train a synthesizer
num_rows_test: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.
- (default) 20: Choose 20 rows to leave out (1 at a time) and measure differential privacy.
test_data_seed: A seed to use to deterministically pick the rows to test
- (default) None: Do not set a seed. Different rows may be left out each time you call this evaluation tool
verbose: Whether to show progress.
- (default) True: Show a progress bar for each row that is tested
- False: Do not show a progress bar

Returns: A privacy score representing the empirical differential privacy using the synthesizer algorithm for the given dataset. The score ranges from 0 to 1, describing the impact that 1 row of training data has on the synthesizer.

(best) 1.0: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.
(worst) 0.0: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.

SDVerified stamp of approval. After running this function, your synthesizer will have recorded the fact that you have verified it.

my_dpgc_synthesizer.is_verified()

{
    'differential_privacy_verified': True
}

At this point, you can save your synthesizer object as a file and share it with others.

my_dpgc_synthesizer.save('my_dpgc_synthesizer.pkl')

Measure DP on any synthesizer

Use the measure_differential_privacy tool to empirically measure the differential privacy of any synthesizer algorithm on a dataset. You can supply any single-table SDV synthesizer for this evaluation.

from sdv.evaluation.single_table import measure_differential_privacy

privacy_score = measure_differential_privacy(
    data=my_dataframe,
    metadata=my_metadata,
    synthesizer_name='GaussianCopulaSynthesizer',
    synthesizer_parameters={ 'default_distribution': 'norm' },
    num_rows_synthetic_data=1000000,
    num_rows_test=10,
    test_data_seed=42,
    verbose=True
)

Parameters:

(required) data: A pandas.DataFrame containing the real data for training the synthesizer
(required) metadata: An SDV Metadata object that describes your data
(required) synthesizer_name: A string with the name of the synthesizer algorithm to use. You can choose from any of the single-table SDV synthesizers that you have access to.
synthesizer_parameters: A dictionary with the parameters to pass into the synthesizer. Use this to fine-tune the synthesizer algorithm.
- (default) None: Use the default parameters for the given synthesizer
- <dict>: A dictionary of parameters to use to fine-tune the synthesizer algorithm. The keys represent the parameter names, and the values are the parameter values.
num_rows_synthetic_data: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.
- (default) 1000000: Create 1 million rows of synthetic data each time we train a synthesizer
num_rows_test: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.
- (default) 20: Choose 20 rows to leave out (1 at a time) and measure differential privacy.
test_data_seed: A seed to use to deterministically pick the rows to test
- (default) None: Do not set a seed. Different rows may be left out each time you call this evaluation tool
verbose: Whether to show progress.
- (default) True: Show a progress bar for each row that is tested
- False: Do not show a progress bar

(best) 1.0: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.
(worst) 0.0: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.

PreviousPrivacy NextData Preparation

Last updated 1 month ago