DCRBaselineProtection

The DCRBaselineProtection metric measures the distance between your synthetic data and real data to measure how private it is. For a fair measurement, it compares the distance against randomly generated data, which would provide the best possible privacy protection.

Data Compatibility

Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps

Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)

Score

(best) 1.0: The synthetic data has the highest possible privacy protection. Replacing the synthetic data entirely with random data would not improve the privacy.

(worst) 0.0: The synthetic has the worst possible privacy protection. Compared to random data, the synthetic data is much closer to the real data.

Scores between 0.0 and 1.0 indicate relative closeness of the synthetic and real data. A score of 0.5 indicates that synthetic data is 50% closer to the real data than random data would be.

How does it work?

This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.

It compares this DCR measurement to random data to get a sense for the relative privacy protection. Steps:

Go through every row in the synthetic data and measure the DCR from the real data (see DCR Algorithm in next section). At the end, there will be a set of DCR measurements for each synthetic data row. Store the median as the synthetic_data_median.
Create random data by uniformly sampling values within the range of the real data. The random data should be the same size as the synthetic data.
Repeat step 1 but using the random data in place of the synthetic data. Store the median as random_data_median.
Compute the final score:

score = MIN\left(1, \frac{synthetic\_data\_median}{random\_data\_median}\right)

DCR Algorithm

Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.

Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.

Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data: $distance(r_j, d_j) = \frac{| r_j - d_j |}{ \text{ range of column}}$
For categorical data (and null values): $distance(r_j, d_j) = 0 \text{ IF } r_j ==d_j \text{ ELSE } 1$
The overall distance between r and d is the average of all distances. $distance(r, d) = \frac{\sum_j distance(r_j, d_j)}{\# values}$

Usage

Access this metric from the single_table module and use the compute_breakdown method.

from sdmetrics.single_table import DCRBaselineProtection

score = DCRBaselineProtection.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    metadata=metadata_dictionary
)

Parameters

(required) real_data: A pandas.DataFrame object containing the real data
(required) synthetic_data: A pandas.DataFrame object containing the synthetic data
(required) metadata: A metadata dictionary that describes the table of data
num_rows_subsample: An integer containing the number of rows to subsample when computing this metric.
- (default) None: Do not subsample the data. Use all of the real and synthetic data to compute the final score.
- <int>: Subsample the real and synthetic data to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations: An integer representing the number of iterations to complete when computing the metric.
- (default) 1: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling
- <int>: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.

The compute_breakdown method returns the overall score, as well as the median DCR values for the synthetic and random datasets (compared to the real data).

{
  'score': 0.58808,
  'median_DCR_to_real_data': {
    'synthetic_data': 0.12355,
    'random_data_baseline': 0.21009
  }
}

FAQs

Is there a way to speed up the computation?

This metric may take a long time to compute on larger datasets because the DCR algorithm computes the distance between every pair of rows in the datasets.

To speed up the computation, use the num_rows_subsample parameter to shorten the data. This will estimate the total score. Use num_iterations to stabilize the score by running the subsample multiple times.

What is the purpose of a baseline and why is it based on random data?

The distance between your synthetic and real datasets is related to privacy, but its exact value can be based on elements that are not in your control. For example, if your dataset only contains a few columns, there aren't many possibilities for creating synthetic data. Your synthetic and real datasets will be closer than if you had hundreds of columns.

The baseline score is meant to take this into account. It is based on random data because it is the least sensitive data that you could possible create, representing the best case scenario for data privacy.

What can I do if my score is NaN?

This metric will produce a NaN score if the random data and real data have matching rows. This can happen if there are few possibilities for what the data can be. For example, if there were only 2 categorical columns with 2 possibilities each, there are only 4 possibilities for what the data can be.

In this case, the DCR method is not a appropriate choice for measuring privacy. We recommend using a different privacy metric instead. Browse additional choices here.

Are there other metrics based on DCR that I can use?

The DCROverfittingProtection metric uses the same DCR computations to measure whether the synthetic data was overfit to the real data. DCROverfittingProtection and this metric are related, however overfitting protection is specifically designed to measure how well your synthesizer generalized patterns from the real data.

DCROverfittingProtection requires that you have a validation (holdout) dataset that was never used for the creation of synthetic data.

References

[1] https://en.wikipedia.org/wiki/Hamming_distance

PreviousPrivacy Metrics NextDCROverfittingProtection

Last updated 3 months ago