DCROverfittingProtection
Coming soon! This metric will be available soon in an upcoming SDMetrics release
The DCROverfittingProtection metric measures the distance between your synthetic data and real data to measure how private it is. In particular, it estimate whether the distance is too close by comparing the real data against a holdout, validation set.
In this context, overfitting refers to your synthesizer being overfit on the real data and unable to generalize the synthetic data that it creates.
Data Compatibility
Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps
Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)
Score
(best) 1.0: The synthetic data is not overfitted to real data. The synthetic data is never closer to the real data than it is to the validation data.
(worst) 0.0: The synthetic is completely overfitted to the real data. The synthetic data is always closer to the real data than it is to the validation data.
Scores between 0.0 and 1.0 indicate bias of the synthetic data being close to the real data. A score of 0.5 means that the synthetic data is 50% more likely to be closer to the real data than the validation data.
How does it work?
This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.
It compares this DCR measurement to the validation set to gauge whether the synthetic data is overfit. Steps:
Go through every row in the synthetic data and measure the DCR from the real data (see DCR Algorithm in next section).
Go through the same rows in the synthetic data and measure the DCR from the validation data instead (see DCR algorithm in next section).
Now for every row of synthetic data, we'll have a DCR score for the real data and the validation data. Sort the synthetic data into two bins:
Set S_r: The synthetic data rows that are closer to the real data than the validation data
Set S_v: The remaining synthetic data rows (i.e. that are NOT closer to the real data than the validation)
The final score is based on the proportion of synthetic data rows that are closer to the real data than the validation.
DCR Algorithm
Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.
Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.
Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data: For categorical data (and null values):
The overall distance between r and d is the average of all distances.
Usage
Access this metric from the single_table
module and use the compute_breakdown
method.
Parameters
(required)
real_training_data
: A pandas.DataFrame object containing the real data that was used to train your synthesizer and create synthetic data(required)
synthetic_data
: A pandas.DataFrame object containing the synthetic data(required)
real_validation_data
: A pandas.DataFrame containing a separate, holdout set of real data. This data should not have been used to train your synthesizer or create synthetic data. For an accurate score, we recommend the size of the validation set should be about equal to the size of the training data.(required)
metadata
: A metadata dictionary that describes the table of datanum_rows_subsample
: An integer containing the number of rows to subsample from the synthetic data when computing this metric.(default)
None
: Do not subsample the synthetic data. Use all of the synthetic data to compute the final score.<int>
: Subsample the synthetic data to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations
: An integer representing the number of iterations to complete when computing the metric.(default)
1
: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling<int>
: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.
The compute_breakdown
method returns the overall score, as well as the percent of synthetic data rows that were closer to the training data versus the holdout data.
FAQs
References
Last updated