LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Metrics in Beta

CSTest

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes. You can think of the shape as what you observe when you plot a bar graph of the column.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric works on boolean data

This metric does not accept missing values

Score

(best) 1.0: The p-value is high, indicating that the synthetic data is not very different from the real data

(worst) 0.0: The p-value is low, indicating that the synthetic data is significantly different than the real data

How does it work?

This test normalizes the real and synthetic data in order to compute the category frequencies. Then, it applies the Chi-squared test [1] to test the null hypothesis that the synthetic data comes from the same distribution as the real data.

The test returns the p-value [2], where a smaller p-value indicates that the synthetic data is significantly different from the real data, rejecting the null hypothesis and leading to a worse overall score.

Usage

Access this metric from the single_column module and use the compute method.

from sdmetrics.single_column import CSTest

CSTest.compute(
    real_data=real_column,
    synthetic_data=synthetic_column
)

Parameters

  • (required) real_data: A pandas.Series containing a single column

  • (required) synthetic_data: A similar pandas.Series object with the synthetic version of the column

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

  • This test is invalid when the category frequencies are too small [3]. Because this test normalizes all frequencies, they are all <1, affecting the overall results.

  • The p-value may be hacked by supplying data of different sizes

  • The p-value interpretation may not be useful. Most users are interested in quantifying the differences between real and synthetic data, not testing the null hypothesis. In fact, if the synthetic data model is a simplification of the real data, the null hypothesis will nearly always be always be false with enough data.

References

PreviousMetrics in BetaNextData Likelihood

Last updated 5 months ago

Consider switching to the metric to quantify the differences between discrete real and synthetic columns.

[1]

[2]

[3]

TVComplement
https://en.wikipedia.org/wiki/Chi-squared_test
https://en.wikipedia.org/wiki/P-value
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html