DCROverfittingProtection
The DCROverfittingProtection metric measures the distance between your synthetic data and real training data to measure how private it is. It compares this against a validation set to determine the relative closeness.
In this context, real training data refers to the data used to train your synthesizer. If your synthesizer was overfit to the real training data, that means it's creating synthetic data that is too close to it. This bad for privacy.
Data Compatibility
Categorical: This metric is defined for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric is defined for numerical data
Datetime: This metric works on datetime data by converting to numerical timestamps
Missing values are supported. This metric considers missing values as a single, separate category value. This metric ignores any other types of columns (ID, PII, etc.)
Score
(best) 1.0: The synthetic data is not overfitted to real training data. The synthetic data is never closer to the real training data than it is to the validation data.
(worst) 0.0: The synthetic is completely overfitted to the real training data. The synthetic data is always closer to the real training data than it is to the validation data.
Scores between 0.0 and 1.0 indicate bias of the synthetic data being close to the real training data. A score of 0.5 means that the synthetic data is 50% more likely to be closer to the real training data than the validation data.
How does it work?
This metric uses DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data.
It compares this DCR measurement to the validation set to gauge whether the synthetic data is overfit. Steps:
Go through every row in the synthetic data and measure the DCR from the real training data (see DCR Algorithm in next section).
Go through the same rows in the synthetic data and measure the DCR from the validation data instead (see DCR algorithm in next section).
Now for every row of synthetic data, we'll have a DCR score for the real training data and the validation data. Sort the synthetic data into two bins:
Set S_r: The synthetic data rows that are closer to the real training data than the validation data
Set S_v: The remaining synthetic data rows (i.e. that are NOT closer to the real training data than the validation)
The final score is based on the proportion of synthetic data rows that are closer to the real training data than the validation.
DCR Algorithm
Given a row of data (r) and an entire dataset (D): We measure the distance between r and every single row inside D. The minimum distance is the DCR between r and D.
Measuring the distance between two rows (r and d): For categorical data, we use the Hamming distance [1] while for numerical data we use the absolute value, standardized to the range of the column.
Loop through every value (j) in these rows and compute the distance between the values. Call these values r_j and d_j. The distance between these values is based on the type of data For numerical data: For categorical data (and null values):
The overall distance between r and d is the average of all distances.
Usage
Access this metric from the single_table
module and use the compute_breakdown
method.
Parameters
(required)
real_training_data
: A pandas.DataFrame object containing the real data that was used to train your synthesizer and create synthetic data(required)
synthetic_data
: A pandas.DataFrame object containing the synthetic data(required)
real_validation_data
: A pandas.DataFrame containing a separate, holdout set of real data. This data should not have been used to train your synthesizer or create synthetic data. For an accurate score, we recommend the size of the validation set should be about equal to the size of the training data.num_rows_subsample
: An integer containing the number of rows to subsample from the data when computing this metric.(default)
None
: Do not subsample the data. Use all of the data to compute the final score.<int>
: Subsample the datasets to the given number of rows. The subsample will estimate the overall score while improving the computation speed.
num_iterations
: An integer representing the number of iterations to complete when computing the metric.(default)
1
: Only perform 1 iteration. Use this when you are computing the metric on the entire synthetic dataset, without any subsampling<int>
: Perform the given number of iterations. The final score will be the average of the iterations. Use this when you are computing the metric on subsamples of the synthetic data.
The compute_breakdown
method returns the overall score, as well as the percent of synthetic data rows that were closer to the training data versus the holdout data.
FAQs
References
Last updated