Diagnostic

The Diagnostic Report runs some basic checks for data format and validity. Run this to ensure that you have created valid synthetic data.

New and improved! Starting from SDV version 1.8.0, you'll see a new diagnostic intended to find problems with the synthetic data. You will notice some key improvements to the report and its interpretation.

Usage

Run the diagnostic to receive a score and a corresponding report.

run_diagnostic

Use this function to run a diagnostic on the synthetic data.

from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)
Generating report ...

(1/3) Evaluating Data Validity: |██████████| 15/15 [00:00<00:00, 603.69it/s]|
Data Validity Score: 100.0%

(2/3) Evaluating Data Structure: |██████████| 2/2 [00:00<00:00, 151.49it/s]|
Data Structure Score: 100.0%

(3/3) Evaluating Relationship Validity: |██████████| 1/1 [00:00<00:00, 68.51it/s]|
Relationship Validity Score: 100.0%

Overall Score (Average): 100.0%

Parameters:

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the synthetic data

  • (required) metadata: A MultiTableMetadata object with your metadata

  • verbose: A boolean describing whether or not to print the report progress and results. Defaults to True. Set this to False to run the report silently.

Returns: An SDMetrics DiagnosticReport object generated with your real and synthetic data

Interpreting the Score

The score should be 100%. The diagnostic report checks for basic data validity and data structure issues. You should expect the score to be perfect for any of the default SDV synthesizers.

What's Included?

The basic diagnostic checks are summarized in the table below.

PropertyDescription

Data Validity

Basic validity checks for each of the columns:

  1. Primary keys must always be unique and non-null

  2. Continuous values in the synthetic data must adhere to the min/max range in the real data

  3. Discrete values in the synthetic data must adhere to the same categories as the real data.

Relationship Validity

Basic validity checks for each relationship between a parent table and a child table:

  1. Each primary key in the parent table must have an appropriate number of children (i.e. cardinality) based on the min/max of the real data.

  2. Each foreign key in the child table must reference a primary key that exists in the parent (i.e. referential integrity).

Structure

Checks to ensure the real and synthetic data have the same column names

get_details

This function returns details about the report's properties. Use it to pinpoint the exact columns or tables that are causing issues.

Parameters:

  • (required) property_name: A string with the name of the property. One of: 'Data Validity', 'Structure', or 'Relationship Validity'

  • table_name: A string with the name of the table. If provided, you'll receive filtered results for the table.

Returns A pandas.DataFrame object with the detailed scores

diagnostic_report.get_details(property_name='Data Validity')
Table     Column	        Metric                   Score
guests    guest_email           KeyUniqueness            1.0
guests    had_rewards	        CategoryAdherence	 1.0
guests    room_type	        CategoryAdherence	 1.0
guests    amenities_fee	        BoundaryAdherence	 1.0

FAQs

See the SDMetrics DiagnosticReport for even more details about the metrics and properties included in the report.

What should I do if the score is not 100%?

All of the default SDV synthesizers should yield a score of 100%. If this is not the case, please contact us with more details about your project via GitHub or Slack.

Note that you have changed any of the defaults — for example, if you have turned off min/max boundary enforcement — then the score may not be 100%.

How did you determine what the validity checks should be?

The items in this report answer the most basic, data validity questions that we have heard from our users and customers. We've collected thousands of pieces of feedback to come up with this basic set.

If you have any questions or suggestions, please contact us via GitHub or Slack.

Older versions of the Diagnostic report contained other metrics. Can I still use them?

Yes! You can compute additional metrics using our standalone SDMetrics library.

If you're used to older versions of the SDV, you may be looking for NewRowSynthesis, CategoryCoverage, and RangeCoverage.

Last updated

Copyright (c) 2023, DataCebo, Inc.