# NewRowSynthesis

This metric measures whether each row in the synthetic data is novel, or whether it exactly matches an original row in the real data.

## Data Compatibility

**Categorical:**This metric works with discrete, categorical columns**Boolean:**This metric works with boolean columns**Numerical**: This metric works with numerical columns**Datetime**: This metric works with datetime columns

This metric also looks for matches in missing values. It ignores any other columns that may be present in your data.

## Score

**(best) 1.0**: The rows in the synthetic data are all new. There are no matches with the real data.

**(worst) 0.0**: All the rows in the synthetic data are copies of rows in the real data.

The example below shows synthetic data that contains matches for half its data. It has a score of 0.60.

## How does it work?

This metric looks for matching rows between the real and synthetic dataset. In order to be considered a match, all the individual values in the real row must match the synthetic row. The exact matching criteria is based on the type of data.

**Categorical/Boolean Data**: The value in the real data must be exactly the same as the value in the synthetic data.

**Numerical/Datetime Data**: This metric scales every value in the real and synthetic data (x). This is shown in the formula below, where *r* represents all the values in the real data.

Then, we consider a match when the synthetic value is within a % of the real value. The % is a parameter, which is set to `0.01`

(1%) by default.

**Missing Values**: To be considered a match, both real and synthetic values must be missing.

Finally, we compute the proportion of rows in the synthetic data that match to a row in the real data. The score is the complement, ensuring that 1 is a good score (every row is unique) while 0 is the worst (every row has a match).

## Usage

Access this metric from the `single_table`

module and use the `compute`

method.

**Parameters**

(required)

`real_data`

: A pandas.DataFrame containing real columns(required)

`synthetic_data`

: A similar pandas.DataFrame containing synthetic columns(required)

`metadata`

: A metadata dictionary describing the columns (see Single Table Metadata)`numerical_match_tolerance`

: A float >0.0 representing how close two numerical values have to be in order to be considered a match.(default)

`0.01`

, which represents 1%

`synthetic_sample_size`

: The number of synthetic rows to sample before computing this metric. Use this to speed up the computation time if you have a large amount of synthetic data. Note that the final score may not be as precise if your sample size is low.(default)

`None`

: Do not sample. Use all of the synthetic data in the computation.

## FAQs

## References

[1] https://en.wikipedia.org/wiki/Generative_adversarial_network

Last updated