DCRBaselineProtection
Last updated
Last updated
The DCRBaselineProtection metric measures the distance between your synthetic data and real data to measure how private it is. For a fair measurement, it compares the distance against randomly generated data, which would provide the best possible privacy protection.
Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps
Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)
(best) 1.0: The synthetic data has the highest possible privacy protection. Replacing the synthetic data entirely with random data would not improve the privacy.
(worst) 0.0: The synthetic has the worst possible privacy protection. Compared to random data, the synthetic data is much closer to the real data.
Scores between 0.0 and 1.0 indicate relative closeness of the synthetic and real data. A score of 0.5 indicates that synthetic data is 50% closer to the real data than random data would be.
This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.
It compares this DCR measurement to random data to get a sense for the relative privacy protection. Steps:
Go through every row in the synthetic data and measure the DCR from the real data (see DCR Algorithm in next section). At the end, there will be a set of DCR measurements for each synthetic data row. Store the median as the synthetic_data_median.
Create random data by uniformly sampling values within the range of the real data. The random data should be the same size as the synthetic data.
Repeat step 1 but using the random data in place of the synthetic data. Store the median as random_data_median.
Compute the final score:
Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.
Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.
Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data:
For categorical data (and null values):
The overall distance between r and d is the average of all distances.
Access this metric from the single_table
module and use the compute_breakdown
method.
Parameters
(required) real_data
: A pandas.DataFrame object containing the real data
(required) synthetic_data
: A pandas.DataFrame object containing the synthetic data
num_rows_subsample
: An integer containing the number of rows to subsample when computing this metric.
(default) None
: Do not subsample the data. Use all of the real and synthetic data to compute the final score.
<int>
: Subsample the real and synthetic data to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations
: An integer representing the number of iterations to complete when computing the metric.
(default) 1
: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling
<int>
: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.
The compute_breakdown
method returns the overall score, as well as the median DCR values for the synthetic and random datasets (compared to the real data).
(required) metadata
: A that describes the table of data
In this case, the DCR method is not a appropriate choice for measuring privacy. We recommend using a different privacy metric instead.
The metric uses the same DCR computations to measure whether the synthetic data was overfit to the real data. DCROverfittingProtection and this metric are related, however overfitting protection is specifically designed to measure how well your synthesizer generalized patterns from the real data.
[1]