LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  1. Metrics
  2. Quality Metrics

SequenceLengthSimilarity

PreviousRangeCoverageNextStatisticMSAS

Last updated 5 months ago

This metric is for sequential data. It measures the similarity between a real and synthetic column in terms of the length of sequences that they represent.

Data Compatibility

  • ID : This metric is meant for a column that represents sequence IDs. The IDs are used to distinguish between different sequences.

This metric ignores missing values.

Score

(best) 1.0: The sequence lengths in the synthetic data are exactly the same as the real data

(worst) 0.0: The sequence lengths in the synthetic data are as different as can be from the real data

How does it work?

This metric assumes you have an ID column to represent sequences. For example, if you are storing different sequences of patient health information, the Patient ID column represents the sequence ID. The length of a sequence is determined by how often an ID value repeats.

This metric first computes the length of each sequence in the real data. Since you may have multiple sequences, this will form a distribution of real data, D_r. The metric will then compute the same for the synthetic data, forming a different distribution, D_s.

score=KSComplement(Dr,Ds)score = KSComplement(D_r, D_s)score=KSComplement(Dr​,Ds​)

Usage

Access this metric from the single_column module and use the compute method.

from sdmetrics.single_column import SequenceLengthSimilarity

StatisticSimilarity.compute(
    real_data=real_table['id_column'],
    synthetic_data=synthetic_table['id_column']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Do the ID values have to match up between the real and synthetic data?

No, the ID values are not expected to be the same between the real and synthetic data because they represent entirely different entities. This metric is computing the lengths of the sequences.

This metric will then compare the two distributions using the metric.

KSComplement