TVComplement

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric works well on boolean data

This metric ignores any missing values.

Score

(best) 1.0: The real data is exactly the same as the synthetic data

(worst) 0.0: The real and synthetic data are as different as they can be

The bar graph below compares real and synthetic data. Because of the differences between the categories, the TVComplement score is 0.68.

How does it work?

This test computes the Total Variation Distance (TVD) between the real and synthetic columns. To do this, it first computes the frequency of each category value and expresses it as a probability. The TVD statistic compares the differences in probabilities, as shown in the formula below [1]:

δ(R,S)=12ωΩRωSω\delta(R, S) = \frac{1}{2}\sum_{\omega \in \Omega} | R_\omega-S_\omega|

Here, ω describes all the possible categories in a column, Ω. Meanwhile, R and S refer to the real and synthetic frequencies for those categories. The TVComplement returns 1-TVD so that a higher score means higher quality.

score=1δ(R,S)score = 1 - \delta(R, S)

Usage

Recommended Usage: The Quality Report applies this metric to every compatible column and provides visualizations to understand the score.

To manually use the metric, access the single_column module and call the compute method.

from sdmetrics.single_column import TVComplement

TVComplement.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series containing a single column

  • (required) synthetic_data: A similar pandas.Series object with the synthetic version of the column

FAQs

Is there a similar metric for continuous numerical columns?

Use the KSComplement as the counterpart to this metric for numerical and datetime columns.

Does the score indicate if all the categories are present?

This metric compares the categorical frequencies of real and synthetic data. A perfect score of 1.0 means that the data is exactly the same, meaning all categories are present.

However, if you receive a score lower than 1.0, you cannot draw any conclusions. It may be due to the data being different or it may be due to missing categories. Use the CategoryCoverage metric to get more insight.

References

[1] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

Last updated