LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Quality Metrics

KSComplement

PreviousCorrelationSimilarityNextMissingValueSimilarity

Last updated 2 years ago

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Data Compatibility

  • Numerical : This metric is meant for continuous, numerical data

  • Datetime : This metric converts datetime values into numerical values

This metric ignores missing values.

Score

(best) 1.0: The real data is exactly the same as the synthetic data

(worst) 0.0: The real and synthetic data are as different as they can be

The graphs below show two examples with real and synthetic data (black and green). At the left, the synthetic data is similar to the real data so the score is close to 1. At the right, the shapes are different so the score is lower.

How does it work?

The KSComplement uses the Kolmogorov-Smirnov statistic [1]. To compute this statistic, we convert a numerical distribution into its cumulative distribution function (CDF) [2]. The KS statistic is the maximum difference between the two CDFs, as shown below.

The distance is a value between 0 and 1. In SDMetrics, we invert the statistic: The KSComplement returns 1-(KS statistic) so that a higher score means higher quality.

Usage

To manually run this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import KSComplement

KSComplement.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series containing a single column

  • (required) synthetic_data: A similar pandas.Series object with the synthetic version of the column

FAQs

Is there a similar metric for discrete categorical columns?

References

Recommended Usage: The applies this metric every compatible column and provides visualizations to understand the score.

Use the metric as the counterpart to this test for categorical and boolean columns.

[1]

[2]

Quality Report
TVComplement
Kolmogorov-Smirnov statistic
Cumulative Density Function (CDF)