Data Quality

The Quality Report checks for statistical similarity between the real and the synthetic data. Use this to discover which patterns the synthetic data has captured from the real data.

Usage

Run a quality report to receive a score and a corresponding report.

evaluate_quality

Use this function to run a diagnostic on the synthetic data.

from sdv.evaluation.multi_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)
Generating report ...
(1/4) Evaluating Column Shapes: : 100%|██████████| 13/13 [00:00<00:00, 338.47it/s]
(2/4) Evaluating Column Pair Trends: : 100%|██████████| 22/22 [00:00<00:00, 95.98it/s]
(3/4) Evaluating Cardinality: : 100%|██████████| 2/2 [00:00<00:00, 69.99it/s]
(4/4) Evaluating Intertable Trends: : 100%|██████████| 36/36 [00:00<00:00, 111.46it/s]

Overall Quality Score: 62.49%

Properties:
- Column Shapes: 79.23%
- Column Pair Trends: 42.5%
- Cardinality: 80.0%
- Intertable Trends: 48.24%

Parameters:

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the synthetic data

  • verbose: A boolean describing whether or not to print the report progress and results. Defaults to True. Set this to False to run the report silently.

Interpreting the Score

What's Included?

The different types of data quality are summarized in the table below.

Property
Description

Column Shapes

The statistical similarity between the real and synthetic data for single columns of data. This is often called the marginal distribution of each column.

Column Pair Trends

The statistical similarity between the real and synthetic data for pairs of columns (within the same table). This is often called the correlation or bivariate distributions of the columns.

Cardinality

Within each parent/child relationship, the cardinality refers to the number of children that each parent has.

Intertable Trends

This is similar to column pair trends, but instead refers to columns between differen tables. For example a column between a parent table and a different column in a child table.

get_details

This function returns details about the report's properties. Use it to pinpoint the exact columns or tables that are causing issues.

Parameters:

  • (required) property_name: A string with the name of the property. One of: 'Column Shapes', 'Column Pair Trends', 'Cardinality' or 'Intertable Trends'.

  • table_name: A string with the name of the table. If provided, you'll receive filtered results for the table.

Returns A pandas.DataFrame object with the detailed scores

quality_report.get_details(property_name='Column Shapes', table_name='guests')
Table        Column            Metric             Score
guests       amenities_fee     KSComplement       0.921127
guests       checkin_date      KSComplement       0.926000
...    

save

Use this function to save the report object

Parameters:

  • (required) filepath: The name of file to save the object. This must end with .pkl

Returns (None) Saves the report as a file

quality_report.save(filepath='results/quality_report.pkl')

QualityReport.load

Use this function to load in a previously-saved quality report.

Parameters:

  • (required) filepath: The name of the file where the report is stored

from sdmetrics.reports.multi_table import QualityReport

quality_report = QualityReport.load('results/quality_report.pkl')

FAQs

What can I do to improve the quality score?

We recommend using the report to get more detailed insight.

  1. Identify which properties have low score

Note that it's ok — and even expected — to have a quality score that is not exactly 100%. Many of our users find that the synthetic data is still effective for downstream use.

If my score is very high, does that mean the synthetic data will have high utility?

A high score indicates a high level of statistical similarity between the real and the synthetic data in terms of the properties we've tested (column shapes and column pair trends). This is a proxy of the overall utility the synthetic data may have for your project, but it is not a guarantee.

The only way to capture true data utility is to use your synthetic data for its intended purpose (downstream application). We recommend trying this as soon as possible, iterating to improve your synthetic data.

This reports checks for patterns in 1 and 2-dimensions. Why not higher dimensions?

Higher order distributions of 3 or more columns are not included in the Quality Report. We have found that very high order similarity may have an adverse effect on the synthetic data. After a certain point, it indicates that the synthetic data is just a copy of the real data. (For more information, see the NewRowSynthesis metric.)

Last updated

#190: add_column() to both SingleTableMetadata and MultiTableMetadata

Change request updated