TVComplement
This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.
Data Compatibility
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works well on boolean data
This metric ignores any missing values.
Score
(best) 1.0: The real data is exactly the same as the synthetic data
(worst) 0.0: The real and synthetic data are as different as they can be
The bar graph below compares real and synthetic data. Because of the differences between the categories, the TVComplement score is 0.68.
How does it work?
This test computes the Total Variation Distance (TVD) between the real and synthetic columns. To do this, it first computes the frequency of each category value and expresses it as a probability. The TVD statistic compares the differences in probabilities, as shown in the formula below [1]:
Here, ω describes all the possible categories in a column, Ω. Meanwhile, R and S refer to the real and synthetic frequencies for those categories. The TVComplement returns 1-TVD
so that a higher score means higher quality.
Usage
Recommended Usage: The Quality Report applies this metric to every compatible column and provides visualizations to understand the score.
To manually use the metric, access the single_column
module and call the compute
method.
Parameters
(required)
real_data
: A pandas.Series containing a single column(required)
synthetic_data
: A similar pandas.Series object with the synthetic version of the column
FAQs
References
[1] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures
Last updated