Data Quality

The Quality Report checks for statistical similarity between the real and the synthetic data. Use this to discover which patterns the synthetic data has captured from the real data.

Usage

Run a quality report to receive a score and a corresponding report.

evaluate_quality

Use this function to run a diagnostic on the synthetic data.

from sdv.evaluation.multi_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)
Generating report ...

(1/4) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 564.15it/s]|
Column Shapes Score: 85.61%

(2/4) Evaluating Column Pair Trends: |██████████| 55/55 [00:00<00:00, 110.40it/s]|
Column Pair Trends Score: 71.97%

(3/4) Evaluating Cardinality: |██████████| 1/1 [00:00<00:00, 53.27it/s]|
Cardinality Score: 70.0%

(4/4) Evaluating Intertable Trends: |██████████| 50/50 [00:00<00:00, 86.54it/s]|
Intertable Trends Score: 68.49%

Overall Score (Average): 74.02%

Parameters:

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the synthetic data

  • (required) metadata: A SingleTableMetadata object with your metadata

  • verbose: A boolean describing whether or not to print the report progress and results. Defaults to True. Set this to False to run the report silently.

Returns: An SDMetrics QualityReport object generated with your real and synthetic data

Interpreting the Score

Your score will vary from 0% to 100%. This value tells you how similar the synthetic data is to the real data.

  • A 100% score means that the patterns are exactly the same. For example, if you compared the real data with itself (identity), the score would be 100%.

  • A 0% score means the patterns are as different as can be. This would entail that the synthetic data purposefully contains anti-patterns that are opposite from the real data.

  • Any score in the middle can be interpreted along this scale. For example, a score of 80% means that the synthetic data is about 80% similar to the real data — about 80% of the trends are similar.

The quality score is expected to vary, and you may never achieve exactly 100% quality. That's ok! The SDV synthesizers are designed to estimate patterns, meaning that they may smoothen, extrapolate, or noise certain parts of the data. For more information, see the FAQs.

What's Included?

The different types of data quality are summarized in the table below.

PropertyDescription

Column Shapes

The statistical similarity between the real and synthetic data for single columns of data. This is often called the marginal distribution of each column.

Column Pair Trends

The statistical similarity between the real and synthetic data for pairs of columns (within the same table). This is often called the correlation or bivariate distributions of the columns.

Cardinality

Within each parent/child relationship, the cardinality refers to the number of children that each parent has.

Intertable Trends

This is similar to column pair trends, but instead refers to columns between differen tables. For example a column between a parent table and a different column in a child table.

get_details

This function returns details about the report's properties. Use it to pinpoint the exact columns or tables that are causing issues.

Parameters:

  • (required) property_name: A string with the name of the property. One of: 'Column Shapes', 'Column Pair Trends', 'Cardinality' or 'Intertable Trends'.

  • table_name: A string with the name of the table. If provided, you'll receive filtered results for the table.

Returns A pandas.DataFrame object with the detailed scores

quality_report.get_details(property_name='Column Shapes', table_name='guests')
Table        Column            Metric             Score
guests       amenities_fee     KSComplement       0.921127
guests       checkin_date      KSComplement       0.926000
...    

save

Use this function to save the report object

The report does not save the full real and synthetic datasets. But we still recommend using caution when deciding when to store the report and who to share it with. It does save the metadata along with the score for each property, breakdown and metric.

Parameters:

  • (required) filepath: The name of file to save the object. This must end with .pkl

Returns (None) Saves the report as a file

quality_report.save(filepath='results/quality_report.pkl')

QualityReport.load

Use this function to load in a previously-saved quality report.

Parameters:

  • (required) filepath: The name of the file where the report is stored

Returns An SDMetrics QualityReport object

from sdmetrics.reports.multi_table import QualityReport

quality_report = QualityReport.load('results/quality_report.pkl')

FAQs

See the SDMetrics QualityReport for even more details about the metrics and properties included in the report.

What can I do to improve the quality score?

We recommend using the report to get more detailed insight.

  1. Identify which properties have low score

  2. Use the get_details method for those properties to identify which particular data columns or tables have the lowest scores.

  3. If possible, visualize the data to see how the synthetic data compares to the real data.

Using this information, you can update parameters or the data processing steps for the relevant columns. Refer to the API docs corresponding to your synthesizer and check the available customizations.

Note that it's ok — and even expected — to have a quality score that is not exactly 100%. Many of our users find that the synthetic data is still effective for downstream use.

If my score is very high, does that mean the synthetic data will have high utility?

A high score indicates a high level of statistical similarity between the real and the synthetic data in terms of the properties we've tested (column shapes and column pair trends). This is a proxy of the overall utility the synthetic data may have for your project, but it is not a guarantee.

The only way to capture true data utility is to use your synthetic data for its intended purpose (downstream application). We recommend trying this as soon as possible, iterating to improve your synthetic data.

If you need help with this, please contact us via GitHub or Slack.

This reports checks for patterns in 1 and 2-dimensions. Why not higher dimensions?

Higher order distributions of 3 or more columns are not included in the Quality Report. We have found that very high order similarity may have an adverse effect on the synthetic data. After a certain point, it indicates that the synthetic data is just a copy of the real data. (For more information, see the NewRowSynthesis metric.)

If higher order similarity is a requirement, you likely have a targeted use case for synthetic data (eg. machine learning efficacy). Until we add these reports, you may want to explore other metrics in the SDMetrics library. You may also want to try directly using your synthetic data for the downstream application.

Last updated

Copyright (c) 2023, DataCebo, Inc.