NewRowSynthesis
Last updated
Last updated
This metric measures whether each row in the synthetic data is novel, or whether it exactly matches an original row in the real data.
Categorical: This metric works with discrete, categorical columns
Boolean: This metric works with boolean columns
Numerical : This metric works with numerical columns
Datetime: This metric works with datetime columns
This metric also looks for matches in missing values. It ignores any other columns that may be present in your data.
(best) 1.0: The rows in the synthetic data are all new. There are no matches with the real data.
(worst) 0.0: All the rows in the synthetic data are copies of rows in the real data.
The example below shows synthetic data that contains matches for half its data. It has a score of 0.60.
This metric looks for matching rows between the real and synthetic dataset. In order to be considered a match, all the individual values in the real row must match the synthetic row. The exact matching criteria is based on the type of data.
Categorical/Boolean Data: The value in the real data must be exactly the same as the value in the synthetic data.
Numerical/Datetime Data: This metric scales every value in the real and synthetic data (x). This is shown in the formula below, where r represents all the values in the real data.
Then, we consider a match when the synthetic value is within a % of the real value. The % is a parameter, which is set to 0.01
(1%) by default.
Missing Values: To be considered a match, both real and synthetic values must be missing.
Finally, we compute the proportion of rows in the synthetic data that match to a row in the real data. The score is the complement, ensuring that 1 is a good score (every row is unique) while 0 is the worst (every row has a match).
Access this metric from the single_table
module and use the compute
method.
Parameters
(required) real_data
: A pandas.DataFrame containing real columns
(required) synthetic_data
: A similar pandas.DataFrame containing synthetic columns
(required) metadata
: A metadata dictionary describing the columns (see Single Table Metadata)
numerical_match_tolerance
: A float >0.0 representing how close two numerical values have to be in order to be considered a match.
(default) 0.01
, which represents 1%
synthetic_sample_size
: The number of synthetic rows to sample before computing this metric. Use this to speed up the computation time if you have a large amount of synthetic data. Note that the final score may not be as precise if your sample size is low.
(default) None
: Do not sample. Use all of the synthetic data in the computation.
[1] https://en.wikipedia.org/wiki/Generative_adversarial_network