This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.
- Categorical: This metric is meant for discrete, categorical data
- Boolean: This metric works well on boolean data
This metric ignores any missing values.
(best) 1.0: The real data is exactly the same as the synthetic data
(worst) 0.0: The real and synthetic data are as different as they can be
The bar graph below compares real and synthetic data. Because of the differences between the categories, the TVComplement score is 0.68.
This bar graph shows the frequencies of each category value for real vs. synthetic data. The difference is shown in red.
This test computes the Total Variation Distance (TVD) between the real and synthetic columns. To do this, it first computes the frequency of each category value and expresses it as a probability. The TVD statistic compares the differences in probabilities, as shown in the formula below :
Here, ω describes all the possible categories in a column, Ω. Meanwhile, R and S refer to the real and synthetic frequencies for those categories. The TVComplement returns
1-TVDso that a higher score means higher quality.
To manually use the metric, access the
single_columnmodule and call the
from sdmetrics.single_column import TVComplement
real_data: A pandas.Series containing a single column
synthetic_data: A similar pandas.Series object with the synthetic version of the column
This metric compares the categorical frequencies of real and synthetic data. A perfect score of 1.0 means that the data is exactly the same, meaning all categories are present.
However, if you receive a score lower than 1.0, you cannot draw any conclusions. It may be due to the data being different or it may be due to missing categories. Use the CategoryCoverage metric to get more insight.