DCRBaselineProtection
Coming soon! This metric will be available soon in an upcoming SDMetrics release
The DCRBaselineProtection metric measures the distance between your synthetic data and real data to measure how private it is. For a fair measurement, it compares the distance against randomly generated data, which would provide the best possible privacy protection.
Data Compatibility
Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps
Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)
Score
(best) 1.0: The synthetic data has the highest possible privacy protection. Replacing the synthetic data entirely with random data would not improve the privacy.
(worst) 0.0: The synthetic has the worst possible privacy protection. Compared to random data, the synthetic data is much closer to the real data.
Scores between 0.0 and 1.0 indicate relative closeness of the synthetic and real data. A score of 0.5 indicates that synthetic data is 50% closer to the real data than random data would be.
How does it work?
This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.
It compares this DCR measurement to random data to get a sense for the relative privacy protection. Steps:
Go through every row in the synthetic data and measure the DCR from the real data (see DCR Algorithm in next section). At the end, there will be a set of DCR measurements for each synthetic data row. Store the median as the synthetic_data_median.
Create random data by uniformly sampling values within the range of the real data. The random data should be the same size as the synthetic data.
Repeat step 1 but using the random data in place of the synthetic data. Store the median as random_data_median.
Compute the final score:
DCR Algorithm
Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.
Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.
Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data:
For categorical data (and null values):
The overall distance between r and d is the average of all distances.
Usage
Access this metric from the single_table
module and use the compute_breakdown
method.
Parameters
(required)
real_data
: A pandas.DataFrame object containing the real data(required)
synthetic_data
: A pandas.DataFrame object containing the synthetic data(required)
metadata
: A metadata dictionary that describes the table of datanum_rows_subsample
: An integer containing the number of rows to subsample from the synthetic data when computing this metric.(default)
None
: Do not subsample the synthetic data. Use all of the synthetic data to compute the final score.<int>
: Subsample the synthetic data to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations
: An integer representing the number of iterations to complete when computing the metric.(default)
1
: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling<int>
: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.
The compute_breakdown
method returns the overall score, as well as the median DCR values for the synthetic and random datasets (compared to the real data).
FAQs
References
Last updated