NewRowSynthesis

This metric measures whether each row in the synthetic data is novel, or whether it exactly matches an original row in the real data.

Data Compatibility

Categorical: This metric works with discrete, categorical columns
Boolean: This metric works with boolean columns
Numerical : This metric works with numerical columns
Datetime: This metric works with datetime columns

This metric also looks for matches in missing values. It ignores any other columns that may be present in your data.

Score

This metric is in Beta. We recommend caution when interpreting the score.

It may be normal for your synthetic data to contain a row that matches your real data by complete, random chance. In fact, a few matches might actually be a good thing! An adversary won't be able to guess the real rows by identifying which combinations don't appear in the synthetic data.

(best) 1.0: The rows in the synthetic data are all new. There are no matches with the real data.

(worst) 0.0: All the rows in the synthetic data are copies of rows in the real data.

The example below shows synthetic data that contains matches for half its data. It has a score of 0.60.

How does it work?

This metric looks for matching rows between the real and synthetic dataset. In order to be considered a match, all the individual values in the real row must match the synthetic row. The exact matching criteria is based on the type of data.

Categorical/Boolean Data: The value in the real data must be exactly the same as the value in the synthetic data.

Numerical/Datetime Data: This metric scales every value in the real and synthetic data (x). This is shown in the formula below, where r represents all the values in the real data.

\text{scaled}(x) = \frac{x - \min(r)}{\max(r) - \min(r)}

Then, we consider a match when the synthetic value is within a % of the real value. The % is a parameter, which is set to 0.01 (1%) by default.

Missing Values: To be considered a match, both real and synthetic values must be missing.

Finally, we compute the proportion of rows in the synthetic data that match to a row in the real data. The score is the complement, ensuring that 1 is a good score (every row is unique) while 0 is the worst (every row has a match).

score = 1 - \frac{\text{matching synthetic rows}}{\text{total synthetic rows}}

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import NewRowSynthesis

NewRowSynthesis.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    metadata=single_table_metadata_dict,
    numerical_match_tolerance=0.01,
    synthetic_sample_size=10_000
)

Parameters

(required) real_data: A pandas.DataFrame containing real columns
(required) synthetic_data: A similar pandas.DataFrame containing synthetic columns
(required) metadata: A metadata dictionary describing the columns (see Single Table Metadata)
numerical_match_tolerance: A float >0.0 representing how close two numerical values have to be in order to be considered a match.
- (default) 0.01, which represents 1%
synthetic_sample_size: The number of synthetic rows to sample before computing this metric. Use this to speed up the computation time if you have a large amount of synthetic data. Note that the final score may not be as precise if your sample size is low.
- (default) None: Do not sample. Use all of the synthetic data in the computation.

FAQs

What if I have a numerical column with a constant value?

If your real data has a numerical column that is constant, this metric will treat it as a categorical value and look for exact matches with the value no matter what the numerical_match_tolerance is set to.

Should I be concerned if I don't have a perfect score?

An imperfect score indicates that there is at least 1 synthetic row exactly matching a real row. This may be completely valid in certain scenarios:

There aren't many possibilities to begin with. For example, if your dataset contains primarily categorical columns, it may be possible for the synthetic data to cycle through all possibilities relatively quickly. At least a few synthetic rows would contain combinations that match real data.
You've created a lot of synthetic data. The more synthetic data you create, the more combination of values you'll create. You'll likely see a few matches between the real and synthetic data completely by chance.

A few matches might actually be a good thing! An adversary won't be able to guess the real rows by identifying which combinations don't appear in the synthetic data.

However, if you are seeing a significantly low score for a small amount of synthetic data, this may be cause for concern. It means your synthetic data model has failed to generalize the real data.

What can I do to improve this score?

To improve this score, it's necessary to tweak the model that you are using to generate synthetic data. In particular, the model should be improved to better generalize the data rather than repeating it. The methods for improving it are highly dependent on the model that you are using. For example:

GAN-based models: If your model is based on a Generative Adversarial Network (GAN), then you may want to consider training for fewer epochs (iterations) or modifying the architecture to prevent mode collapse [1].
Statistical models: If your model is based on classical statistics, it may be possible to update or noise the parameters to make them more general.

Note that improving this score may decrease overall synthetic data quality. Please check with your synthetic data provider for the best course of action.

References

[1] https://en.wikipedia.org/wiki/Generative_adversarial_network

PreviousRegression Next＊ OutlierCoverage

Last updated 8 months ago