LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  • References
  1. Metrics
  2. Metrics in Beta
  3. Data Likelihood

GMLikelihood

Data Likelihood describes a set of metrics that calculate the likelihood of the synthetic data belonging to the real data. This metric uses Gaussian Mixture Models to make this calculation.

Data Compatibility

  • Numerical : This metric is meant for continuous, numerical data

This metric ignores any incompatible column types.

This metric does not accept missing values

Score

(highest) ∞: According to the algorithm, the synthetic data has the highest possible likelihood of belonging to the real data

(lowest) -∞: According to the algorithm used, the synthetic data has the lowest possible likelihood of belonging to the real data

There are multiple interpretations of the score. A high score can indicates high synthetic data quality as well as low privacy. A low score can indicate low synthetic data quality as well as high privacy.

How does it work?

This metric fits multiple Gaussian mixture models [1] to learn the distribution of the real data. The model learns to produce a likelihood estimate for every row ranging from -∞ to to +∞, where -∞ means the row is likely not part of the data and +∞ means that it is.

We apply the model to all the synthetic data and return the average likelihood score.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import GMLikelihood

GMLikelihood.compute(
    real_data=real_table,
    synthetic_data=synthetic_table
)

Parameters

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the same columns of synthetic data

  • n_components: Number of components to use for the mixture model

(default) (1, 30)

Search for the optimal number of components between 1 and 30

(<low integer>, <high integer>)

Search for the optimal number of components between the low and high integer

<integer>

Use exactly the integer number of components provided

  • iterations: Number of times that each number of components should be evaluated before averaging the scores. Defaults to 3.

  • retries: Number of times that each iteration will be retried if the mixture model crashes during fit. Defaults to 3.

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

  • The score heavily depends on algorithm used to model the data. If the overall distribution of the real data cannot be learned well, then the likelihood estimates of the synthetic data may not be valid.

References

PreviousBNLogLikelihoodNextDetection: Sequential

Last updated 2 months ago

metadata: A metadata dictionary describing the columns (see )

covariance_type: A string describing the the covariance type to use for the mixture models. If multiple values are passed, the best one will be searched. Defaults to 'diag'. See the for other possible values.

There are multiple interpretations for this metric. (See the section above.) Of course, this is heavily dependent on how well we trust the algorithm to model the real data.

[1]

Single Table Metadata
sklearn API
https://en.wikipedia.org/wiki/Mixture_model
Score