DCROverfittingProtection

The DCROverfittingProtection metric measures the distance between your synthetic data and real training data to measure how private it is. It compares this against a validation set to determine the relative closeness.

In this context, real training data refers to the data used to train your synthesizer. If your synthesizer was overfit to the real training data, that means it's creating synthetic data that is too close to it. This bad for privacy.

Why does closeness matter?

If your synthetic data is overfitted, this means it is very close to the real training data (high quality) but not as close to the validation data. This is bad for privacy because the synthetic data is leaking the real values.
On the other hand, if the synthetic data is generalizing well, it should be equidistant from the real training data and validation data. The synthetic data is not overfit on the real training data, so it's not leaking any values. This is good for privacy.

Data Compatibility

Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps

Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)

Score

(best) 1.0: The synthetic data is not overfitted to real training data. The synthetic data is never closer to the real training data than it is to the validation data.

(worst) 0.0: The synthetic is completely overfitted to the real training data. The synthetic data is always closer to the real training data than it is to the validation data.

Scores between 0.0 and 1.0 indicate bias of the synthetic data being close to the real training data. A score of 0.5 means that the synthetic data is 50% more likely to be closer to the real training data than the validation data.

How does it work?

This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.

It compares this DCR measurement to the validation set to gauge whether the synthetic data is overfit. Steps:

Go through every row in the synthetic data and measure the DCR from the real training data (see DCR Algorithm in next section).
Go through the same rows in the synthetic data and measure the DCR from the validation data instead (see DCR algorithm in next section).
Now for every row of synthetic data, we'll have a DCR score for the real training data and the validation data. Sort the synthetic data into two bins:
1. Set S_r: The synthetic data rows that are closer to the real training data than the validation data
2. Set S_v: The remaining synthetic data rows (i.e. that are NOT closer to the real training data than the validation)
The final score is based on the proportion of synthetic data rows that are closer to the real training data than the validation.

score = MIN\left(1, \left(1 - \frac{|S_r|}{|S_r| + |S_v|}\right)\times2\right)

DCR Algorithm

Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.

Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.

Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data: $distance(r_j, d_j) = \frac{| r_j - d_j |}{ \text{ range of column}}$ For categorical data (and null values): $distance(r_j, d_j) = 0 \text{ IF } r_j ==d_j \text{ ELSE } 1$
The overall distance between r and d is the average of all distances. $distance(r, d) = \frac{\sum_j distance(r_j, d_j)}{\# values}$

Usage

Access this metric from the single_table module and use the compute_breakdown method.

from sdmetrics.single_table import DCROverfittingProtection

score = DCROverfittingProtection.compute_breakdown(
    real_training_data=training_table,
    synthetic_data=synthetic_table,
    real_validation_data=holdout_table,
    metadata=metadata_dictionary
)

Parameters

(required) real_training_data: A pandas.DataFrame object containing the real data that was used to train your synthesizer and create synthetic data
(required) synthetic_data: A pandas.DataFrame object containing the synthetic data
(required) real_validation_data: A pandas.DataFrame containing a separate, holdout set of real data. This data should not have been used to train your synthesizer or create synthetic data. For an accurate score, we recommend the size of the validation set should be about equal to the size of the training data.
(required) metadata: A metadata dictionary that describes the table of data
num_rows_subsample: An integer containing the number of rows to subsample from the data when computing this metric.
- (default) None: Do not subsample the data. Use all of the data to compute the final score.
- <int>: Subsample the datasets to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations: An integer representing the number of iterations to complete when computing the metric.
- (default) 1: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling
- <int>: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.

The compute_breakdown method returns the overall score, as well as the percent of synthetic data rows that were closer to the training data versus the holdout data.

{
  'score': 0.5,
  'synthetic_data_percentages': {
    'closer_to_training': 0.75,
    'closer_to_holdout': 0.25
  }
}

FAQs

Is there a way to speed up the computation?

This metric may take a long time to compute on larger datasets because the DCR algorithm computes the distance between every pair of rows in the datasets.

To speed up the computation, use the num_rows_subsample parameter to shorten the synthetic data. This will estimate the total score. Use num_iterations to stabilize the score by running the subsample multiple times.

If I don't have a validation set, can I just use part of my real data?

The purpose of a validation set is to measure whether the synthetic data can generalize to new, previously-unseen data. It's very important that the validation data must never have been used to create a synthesizer/synthetic data. Otherwise, this metric will not be accurate.

If you do cannot easily get a validation dataset, we recommend using DCRBaselineProtection instead.

Are there other metrics based on DCR that I can use?

The DCRBaselineProtection metric uses the same DCR computations to measure the distance between the synthetic and real data. DCRBaselineProtection is related to this metric, however it is specifically designed to measure the general distance relative to a random data baseline.

DCRBaselineProtection does not require you do have a validation (holdout) dataset.

References

[1] https://en.wikipedia.org/wiki/Hamming_distance

PreviousDCRBaselineProtection NextDisclosureProtection

Last updated 6 months ago