LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  1. Metrics
  2. Diagnostic Metrics

CategoryAdherence

PreviousCardinalityBoundaryAdherenceNextKeyUniqueness

Last updated 1 month ago

This metric measures whether a synthetic column adheres to the same category values as the real data. (The synthetic data should not be inventing new category values that are not originally present in the real data.)

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric is meant for boolean data

If you have missing values in the real data, then the metric will consider them valid in the synthetic data. Otherwise, they will be marked as an invalid category. All types of missing values (NaN, None, etc. will be counted as the same category of 'missing'.)

Score

  • (best) 1.0: All category values in the synthetic data were present in the real data

  • (worst) 0.0: None of the category values in the synthetic data were present in the real data

Any score in between tells us the proportion of data points that are adhering to the correct values. For example, 0.6 means that 60% of synthetic data points have a value present in in the real data. Meanwhile, the remaining 40% contain new values that were never present in the real data.

How does it work?

This metric extracts the set of unique categories, that are present in the real column, Cr.

Then it finds the of data points of the synthetic data, s, that are found in the set C. The score is the proportion of these data points as compared to all the synthetic data points.

score=∣s,s∈Cr∣∣s∣score = \frac{| s, s \in C_r |}{|s|}score=∣s∣∣s,s∈Cr​∣​

Usage

To manually apply this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import CategoryAdherence

CategoryAdherence.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Is there an equivalent metric for continuous data?
Does this metric measure quality or data coverage?

No. This metric is a measure of validity, as we generally consider discrete data to be valid only if it contains the correct category values.

Recommended Usage: The applies this metric to applicable columns.

For continuous datasets, many values are possible. Use the metric to ensure they within the correct min/max bounds.

Data quality refers to the frequency of each particular category value. To compare this, use the metric.

Data coverage refers to the idea that the synthetic data should cover at least 1 of each category value. To measure this, use the metric.

Diagnostic Report
BoundaryAdherence
TVComplement
CategoryCoverage