Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Usage
  • run_diagnostic
  • Interpreting the Score
  • What's Included?
  • get_details
  • FAQs
  1. Single Table Data
  2. Evaluation

Diagnostic

The Diagnostic Report runs some basic checks for data format and validity. Run this to ensure that you have created valid synthetic data.

Usage

Run the diagnostic to receive a score and a corresponding report.

run_diagnostic

Use this function to run a diagnostic on the synthetic data.

from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)
Generating report ...

(1/2) Evaluating Data Validity: |██████████| 9/9 [00:00<00:00, 458.92it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 104.60it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%

Parameters:

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the synthetic data

  • verbose: A boolean describing whether or not to print the report progress and results. Defaults to True. Set this to False to run the report silently.

Interpreting the Score

The score should be 100%. The diagnostic report checks for basic data validity and data structure issues. You should expect the score to be perfect for any of the default SDV synthesizers.

What's Included?

The basic diagnostic checks are summarized in the table below.

Property
Description

Data Validity

Basic validity checks for each of the columns:

  1. Primary keys must always be unique and non-null

  2. Continuous values in the synthetic data must adhere to the min/max range in the real data

  3. Discrete values in the synthetic data must adhere to the same categories as the real data.

Structure

Checks to ensure the real and synthetic data have the same column names

get_details

This function returns details about the report's properties. Use it to pinpoint the exact columns or tables that are causing issues.

Parameters:

  • (required) property_name: A string with the name of the property. One of: 'Data Validity' or 'Structure'.

Returns A pandas.DataFrame object with the detailed scores

diagnostic_report.get_details(property_name='Data Validity')
Column	        Metric                   Score
guest_email     KeyUniqueness            1.0
had_rewards	CategoryAdherence	 1.0
room_type	CategoryAdherence	 1.0
amenities_fee	BoundaryAdherence	 1.0
...

FAQs

What should I do if the score is not 100%?

Note that you have changed any of the defaults — for example, if you have turned off min/max boundary enforcement — then the score may not be 100%.

How did you determine what the validity checks should be?

The items in this report answer the most basic, data validity questions that we have heard from our users and customers. We've collected thousands of pieces of feedback to come up with this basic set.

Older versions of the Diagnostic report contained other metrics. Can I still use them?
PreviousEvaluationNextData Quality

Last updated 7 months ago

(required) metadata: A object with your metadata

Returns: An object generated with your real and synthetic data

See the for even more details about the metrics and properties included in the report.

All of the default SDV synthesizers should yield a score of 100%. If this is not the case, please contact us with more details about your project via or .

If you have any questions or suggestions, please contact us via or .

Yes! You can compute additional metrics using our standalone .

If you're used to older versions of the SDV, you may be looking for , , and .

Metadata
SDMetrics DiagnosticReport
SDMetrics DiagnosticReport
GitHub
Slack
GitHub
Slack
SDMetrics library
NewRowSynthesis
CategoryCoverage
RangeCoverage