This metric measures whether each row in the synthetic data is novel, or whether it exactly matches an original row in the real data.
- Categorical: This metric works with discrete, categorical columns
- Boolean: This metric works with boolean columns
- Numerical : This metric works with numerical columns
- Datetime: This metric works with datetime columns
This metric also looks for matches in missing values. It ignores any other columns that may be present in your data.
(best) 1.0: The rows in the synthetic data are all new. There are no matches with the real data.
(worst) 0.0: All the rows in the synthetic data are copies of rows in the real data.
The example below shows synthetic data that contains matches for half its data. It has a score of 0.60.
This fictitious example shows real and synthetic data related to taxes. About 40% the synthetic data (first and third row) exactly matches rows in the real data. The rest of the 60% is novel so the overall NewRowSynthesis score is 0.6.
This metric looks for matching rows between the real and synthetic dataset. In order to be considered a match, all the individual values in the real row must match the synthetic row. The exact matching criteria is based on the type of data.
Categorical/Boolean Data: The value in the real data must be exactly the same as the value in the synthetic data.
Numerical/Datetime Data: This metric scales every value in the real and synthetic data (x). This is shown in the formula below, where r represents all the values in the real data.
Then, we consider a match when the synthetic value is within a % of the real value. The % is a parameter, which is set to
0.01(1%) by default.
Missing Values: To be considered a match, both real and synthetic values must be missing.
Finally, we compute the proportion of rows in the synthetic data that match to a row in the real data. The score is the complement, ensuring that 1 is a good score (every row is unique) while 0 is the worst (every row has a match).
Access this metric from the
single_tablemodule and use the
from sdmetrics.single_table import NewRowSynthesis
real_data: A pandas.DataFrame containing real columns
synthetic_data: A similar pandas.DataFrame containing synthetic columns
numerical_match_tolerance: A float >0.0 representing how close two numerical values have to be in order to be considered a match.
0.01, which represents 1%
synthetic_sample_size: The number of synthetic rows to sample before computing this metric. Use this to speed up the computation time if you have a large amount of synthetic data. Note that the final score may not be as precise if your sample size is low.
None: Do not sample. Use all of the synthetic data in the computation.
An imperfect score indicates that there is at least 1 synthetic row exactly matching a real row. This may be completely valid in certain scenarios:
- There aren't many possibilities to begin with. For example, if your dataset contains primarily categorical columns, it may be possible for the synthetic data to cycle through all possibilities relatively quickly. At least a few synthetic rows would contain combinations that match real data.
- You've created a lot of synthetic data. The more synthetic data you create, the more combination of values you'll create. You'll likely see a few matches between the real and synthetic data completely by chance.
A few matches might actually be a good thing! An adversary won't be able to guess the real rows by identifying which combinations don't appear in the synthetic data.
However, if you are seeing a significantly low score for a small amount of synthetic data, this may be cause for concern. It means your synthetic data model has failed to generalize the real data.
To improve this score, it's necessary to tweak the model that you are using to generate synthetic data. In particular, the model should be improved to better generalize the data rather than repeating it. The methods for improving it are highly dependent on the model that you are using. For example:
- GAN-based models: If your model is based on a Generative Adversarial Network (GAN), then you may want to consider training for fewer epochs (iterations) or modifying the architecture to prevent mode collapse .
- Statistical models: If your model is based on classical statistics, it may be possible to update or noise the parameters to make them more general.
Note that improving this score may decrease overall synthetic data quality. Please check with your synthetic data provider for the best course of action.