Quality

Quality metrics capture the statistical similarity between real data and synthetic data. If the synthetic and real data are statistically similar, we refer to the synthetic data as being high quality. We intend the quality metrics to be aspirational, as it may not always be possible to achieve 100% quality on all metrics.

Synthetic data can be measured in two ways. Much of the focus has been on measuring statistical data differences between the real and synthetic data, such as quality measures. But this is not enough. Synthetic data needs to provide a return-on-investment (ROI) for the task it is ultimately meant to accomplish — whether it's software testing, machine learning development, or more. When possible, it's important to include metrics that measure ROI in your evaluation.

SDMetrics includes metrics for statistical data differences as well as for the ultimate ROI for different tasks. The two may or may not correlate.

Quality Report

Measure the quality of your entire dataset. The Quality Report is designed to capture quality measurements across multiple tables and columns. It determines which metrics to apply based on the type of columns, providing a consolidated score.

from sdmetrics.reports.single_table import QualityReport

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 9/9 [00:00<00:00, 273.13it/s]|
Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |██████████| 36/36 [00:00<00:00, 57.42it/s]|
Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7%

Browse

Alternatively, you can apply quality metrics to individual columns and tables in your data:

KSComplement, TVComplement: compare column shapes (aka marginal distributions, histograms)
ContingencySimilarity, CorrelationSimilarity: compare 2D distributions & pairwise correlations
CardinalityShapeSimilarity: compare the frequency of parent/child connections (aka cardinality)
CategoryCoverage, RangeCoverage: measure whether the overall synthetic data spans all the possibilities
SequenceLengthSimilarity, StatisticMSAS: compares the quality of real and synthetic data that represents sequential information
MissingValueSimilarity, StatisticSimilarity: compare individual statistics of the data

PreviousTableStructure NextQuality Report

Last updated 4 months ago