LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  1. Metrics
  2. Quality Metrics

RangeCoverage

PreviousMissingValueSimilarityNextSequenceLengthSimilarity

Last updated 1 year ago

This metric measures whether a synthetic column covers the full range of values that are present in a real column.

Data Compatibility

  • Numerical : This metric is meant for continuous, numerical data

  • Datetime : This metric converts datetime values into numerical values

This metric ignores missing values.

Score

  • (best) 1.0: The synthetic column covers the range of values present in the real column

  • (worst) 0.0: The synthetic column does not overlap at all with the range of values in the real column

The plot below shows some fictitious real and synthetic data (black and green respectively) with RangeCoverage=0.82.

How does it work?

If r and s represent the real and synthetic columns, then this metric computes how close the min and max values of s come to the true min and max values in r according to the formula below.

score=1−[max⁡(min⁡(s)−min⁡(r)max⁡(r)−min⁡(r),0)+max⁡(max⁡(r)−max⁡(s)max⁡(r)−min⁡(r),0)]score = 1 - \left[\max\left(\frac{\min(s)-\min(r)}{\max(r)-\min(r)}, 0\right) + \max\left(\frac{\max(r)-\max(s)}{\max(r)-\min(r)}, 0\right)\right]score=1−[max(max(r)−min(r)min(s)−min(r)​,0)+max(max(r)−min(r)max(r)−max(s)​,0)]

If the synthetic data does has extremely poor range coverage, the equation above may become negative. In this case, we report a score 0 since it is the lowest possible value.

Note that the score isn't penalized if the synthetic data data goes out of bounds. If the synthetic data reaches beyond the real min and max, the range is fully covered and the score will be 1.

Usage

To manually apply this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import RangeCoverage

RangeCoverage.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Is there an equivalent metric for discrete data?
What if the synthetic data is going out of bounds?

If the synthetic data is going out of bounds (the min is less than the real min or the max is greater than the real max), then this metric considers that part of the range covered.

Use the metric with discrete data, for example categorical or boolean values that don't span a continuous range.

If you'd like to quantify when the synthetic data is going out of bounds, use the metric.

CategoryCoverage
BoundaryAdherence
The real data is in range [37.0, 97.7] while the synthetic data is in the range [45, 95]. The synthetic data fails to cover the lower end and higher ends of the real values: [37, 45] and [95, 97.7]. The missing ranges account for roughly 18% of the overall range, making the RangeCoverage 0.82.