Empirical Differential Privacy
Last updated
Last updated
is a mathematically-rigorous framework that you can use to create private synthetic data. Using our evaluation tool, you can empirically verify the differential privacy that a synthesizer algorithm is offering for a dataset.
In the differential privacy setup, we are interested in measuring the impact that 1 row of training data has on the overall parameters that a synthesizer learns. Depending on the synthesizer's exact algorithm, the parameters may not be easily accessible or interpretable. Instead, we can create synthetic data using the synthesizer and assume that the patterns exhibited by the synthetic data reflect the parameters.
Our evaluation setup creates multiple synthesizers:
First, we train a synthesizer on all of the real, training data,
Then, we remove a single row of training data, and use it to train a new synthesizer
We can compare the synthetic data that the synthesizers produce. An algorithm with high differential privacy will produce similar synthetic data despite the removal of a row — no matter which row is removed.
Measuring differential privacy may take some time. This empirical measure trains multiple synthesizers. Depending on the synthesizer algorithm, the size of the dataset, and the number of rows you'd like to test, the overall differential privacy measure may take significant time and computing resources. We recommend starting with a smaller dataset and smaller set of test rows.
Parameters:
(required) data
: A pandas.DataFrame containing the real data for training the synthesizer
synthesizer_parameters
: A dictionary with the parameters to pass into the synthesizer. Use this to fine-tune the synthesizer algorithm.
(default) None
: Use the default parameters for the given synthesizer
<dict>
: A dictionary of parameters to use to fine-tune the synthesizer algorithm. The keys represent the parameter names, and the values are the parameter values.
num_rows_synthetic_data
: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.
(default) 1000000
: Create 1 million rows of synthetic data each time we train a synthesizer
num_rows_test
: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.
(default) 20
: Choose 20 rows to leave out (1 at a time) and measure differential privacy.
test_data_seed
: A seed to use to deterministically pick the rows to test
(default) None
: Do not set a seed. Different rows may be left out each time you call this evaluation tool
verbose
: Whether to show progress.
(default) True
: Show a progress bar for each row that is tested
False
: Do not show a progress bar
Returns: A privacy score representing the empirical differential privacy using the synthesizer algorithm for the given dataset. The score ranges from 0 to 1, describing the impact that 1 row of training data has on the synthesizer.
(best) 1.0: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.
(worst) 0.0: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.
In the SDV's setup, we compare the statistical differences in the different synthetic datas using the . (But in reality, we could use any statistical measure.) We repeated this process many times, leaving out a different row each time. The differential privacy score represents the worst case scenario that we measure when leaving out a row of real data.
Use the measure_differential_privacy
tool to empirically measure the differential privacy of a synthesizer algorithm on a dataset. You can supply any for evaluation.
(required) metadata
: An object that describes your data
(required) synthesizer_name
: A string with the name of the synthesizer algorithm to use. You can choose from any of the that you have access to.