LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Quality Metrics

TVComplement

PreviousStatisticSimilarityNextPrivacy Metrics

Last updated 1 year ago

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric works well on boolean data

This metric ignores any missing values.

Score

(best) 1.0: The real data is exactly the same as the synthetic data

(worst) 0.0: The real and synthetic data are as different as they can be

The bar graph below compares real and synthetic data. Because of the differences between the categories, the TVComplement score is 0.68.

How does it work?

This test computes the Total Variation Distance (TVD) between the real and synthetic columns. To do this, it first computes the frequency of each category value and expresses it as a probability. The TVD statistic compares the differences in probabilities, as shown in the formula below [1]:

δ(R,S)=12∑ω∈Ω∣Rω−Sω∣\delta(R, S) = \frac{1}{2}\sum_{\omega \in \Omega} | R_\omega-S_\omega|δ(R,S)=21​ω∈Ω∑​∣Rω​−Sω​∣

Here, ω describes all the possible categories in a column, Ω. Meanwhile, R and S refer to the real and synthetic frequencies for those categories. The TVComplement returns 1-TVD so that a higher score means higher quality.

score=1−δ(R,S)score = 1 - \delta(R, S)score=1−δ(R,S)

Usage

To manually use the metric, access the single_column module and call the compute method.

from sdmetrics.single_column import TVComplement

TVComplement.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series containing a single column

  • (required) synthetic_data: A similar pandas.Series object with the synthetic version of the column

FAQs

Is there a similar metric for continuous numerical columns?
Does the score indicate if all the categories are present?

This metric compares the categorical frequencies of real and synthetic data. A perfect score of 1.0 means that the data is exactly the same, meaning all categories are present.

References

Recommended Usage: The applies this metric to every compatible column and provides visualizations to understand the score.

Use the as the counterpart to this metric for numerical and datetime columns.

However, if you receive a score lower than 1.0, you cannot draw any conclusions. It may be due to the data being different or it may be due to missing categories. Use the metric to get more insight.

[1]

Quality Report
KSComplement
CategoryCoverage
https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures
This bar graph shows the frequencies of each category value for real vs. synthetic data. The difference is shown in red.