LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Quality Metrics

StatisticMSAS

PreviousSequenceLengthSimilarityNextStatisticSimilarity

Last updated 5 months ago

This metric is for sequential data. It measures the similarity between a set of real sequences and a synthetic sequences by computing a summary statistic. Supported summary statistics are: mean, median, standard deviation, min, and max.

Data Compatibility

  • ID: This metric requires a column of ID values that distinguish between different sequences

  • Numerical : This metric computes statistics for a continuous, numerical data column

Both columns need to be present for this metric. This metric ignores missing values.

Score

(best) 1.0: The statistic for the real data is exactly the same at the synthetic data

(worst) 0.0: The statistic for the real data is extremely different from the synthetic data

How does it work?

This metric assumes you have an ID column to represent sequences. For example, if you are storing patient health information like below, the Patient ID column represents the sequence ID. You can then compare a statistic across any numerical column such as Systolic BP.

To compute a score, this metric implements the Multi Sequence Aggregate Similarity approach from [1].

  1. It breaks up the real column of numerical values based on the sequence ID and computes a statistic value for each sequence. This yields a distribution of statistics, D_r.

  2. It repeats the process for the synthetic column of numerical values, yielding a separate distribution, D_s.

score=KSComplement(Dr,Ds)score = KSComplement(D_r, D_s)score=KSComplement(Dr​,Ds​)

Usage

Access this metric from the column_pairs module and use the compute method.

from sdmetrics.column_pairs import StatisticSimilarity

StatisticMSAS.compute(
    real_data=(real_table['Patient ID'], real_table['Systolic BP']),
    synthetic_data=(synthetic_table['Patient ID'], synthetic_table['Systolic BP']),
    statistic='mean'
)

Parameters

  • (required) real_data: A tuple of pandas.Series object with the column of real data. The first column represents the ID, while the second represents the numerical column.

  • (required) synthetic_data: A tuple of pandas.Series object with the column of synthetic data. The first column represents the ID, while the second represents the numerical column.

  • statistic: A string describing the name of the statistical function

    • (default) 'mean': The arithmetic mean

    • 'median': The median value

    • 'sdt': The standard deviation

    • 'min': The min value

    • 'max': The max value

FAQs

Do the ID values have to match up between the real and synthetic data?

No, the ID values are not expected to be the same between the real and synthetic data because they represent entirely different entities. This metric is computing the overall statistics between the sequences.

Is it better to use the mean or the median statistic?

The mean and median summarize the values differently, especially when your data has a skew [2].

  • The mean takes all values into account. It may be significantly affected by just a few large or small values. You can use the mean for a fast computation if you know there is no skew in your data or if you are ok with outliers affecting the score.

  • The median finds a middle value where 50% of the data is larger and 50% is smaller. The median is resilient to outliers in either direction. This may be desirable if you have skewed data, as the score is more representative of the typical values. Note that this computation takes longer.

References

This metric will then compare the two distributions using the metric.

[1]

[2]

KSComplement
Sequential Models in the Synthetic Data Vault
https://en.wikipedia.org/wiki/Skewness