# TVComplement

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the *marginal distribution* or 1D histogram of the column.

## Data Compatibility

**Categorical**: This metric is meant for discrete, categorical data**Boolean**: This metric works well on boolean data

This metric ignores any missing values.

## Score

**(best) 1.0**: The real data is exactly the same as the synthetic data

**(worst) 0.0**: The real and synthetic data are as different as they can be

The bar graph below compares real and synthetic data. Because of the differences between the categories, the TVComplement score is 0.68.

## How does it work?

This test computes the Total Variation Distance (TVD) between the real and synthetic columns. To do this, it first computes the frequency of each category value and expresses it as a probability. The TVD statistic compares the differences in probabilities, as shown in the formula below [1]:

Here, ω describes all the possible categories in a column, Ω. Meanwhile, R and S refer to the real and synthetic frequencies for those categories. The TVComplement returns `1-TVD`

so that a higher score means higher quality.

## Usage

**Recommended Usage:** The Quality Report applies this metric to every compatible column and provides visualizations to understand the score.

To manually use the metric, access the `single_column`

module and call the `compute`

method.

**Parameters**

(required)

`real_data`

: A pandas.Series containing a single column(required)

`synthetic_data`

: A similar pandas.Series object with the synthetic version of the column

## FAQs

### References

[1] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

Last updated