LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Quality Metrics

MissingValueSimilarity

This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.

Data Compatibility

  • All data: Any data is compatible with this metric as long as it contains missing values

Score

(best) 1.0: The synthetic data perfectly captures the proportion of missing values

(worst) 0.0: The synthetic data has a completely different proportion of missing values than the real data

How does it work?

This test computes the proportion of missing values, p, in both the real and synthetic data, R and S. It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.

score=1−∣Sp−Rp∣score = 1 - |S_p - R_p|score=1−∣Sp​−Rp​∣

Note that the term at the right is equivalent to the Total Variation Distance [1] of the missing/non-missing values between the real and synthetic data

Usage

Access this metric from the single_column module and use the compute method.

from sdmetrics.single_column import MissingValueSimilarity

MissingValueSimilarity.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series containing a single column with missing values

  • (required) synthetic_data: A pandas.Series object with the synthetic version of the column

FAQs

What kind of values count as missing?

We use the same convention as pandas for determining when a value is missing [2]. Missing values in your data should be represented as NaN objects.

If you are using any special notation to denote missing values, convert them to NaN values before using this metric.

References

PreviousKSComplementNextRangeCoverage

Last updated 2 years ago

[1]

[2]

https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures
https://pandas.pydata.org/docs/user_guide/missing_data.html