Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • * create_and_test_multi_table
  • FAQ
  1. Multi Table Data
  2. Modeling

* Performance Estimates

PreviousPreprocessingNextSampling

Last updated 6 months ago

How well will SDV synthesizers be able to model your full data schema? Use this feature to get some estimates with only your metadata.

* create_and_test_multi_table

Simulate the performance of different multi-table synthesizers using your metadata.

This function uses the to create random data. Then it runs the random data through the different multi-table synthesizers to estimate their performance, as well as the different evaluation reports.

from sdv.utils.multi_table import create_and_test_multi_table

create_and_test_multi_table(
  metadata=my_metadata,
  synthesizers=['HMASynthesizer', 'HSASynthesizer'],
  output_folder='my_performance_results/',
  default_num_rows=1_000_000,
  timeout=3600 # 1 hour per synthesizer
)

Parameters:

  • (required) metadata: A object

  • (required) synthesizers: A list of strings representing the multi-table synthesizers that you want to test. Options are: 'HMASynthesizer', 'HSASynthesizer' or 'IndependentSynthesizer'

  • (required) output_folder: A destination folder where the random data, results, and other artifacts will be saved

  • default_num_rows: An integer with the number of rows to create by default for all tables

    • (default) 1000: Create 1000 rows for every table

  • num_rows_per_table: A dictionary that maps each table name to the number of rows to create for only that table. Values here will override the default num rows set in the previous parameter

    • (default) None: Do not override the default number of rows for any individual table

  • timeout: The maximum number of seconds to give to each synthesizer to train and sample the dataset

    • (default) None: Do not set a maximum. Allow the synthesizer to take as long as it needs.

    • <integer>: Allow a synthesizer to run on the integer number of seconds for each dataset. If the synthesizer is exceeding the time, the output will include a TimeoutError.

Output A pandas DataFrame with detailed performance results from each synthesizer

Interpreting the results

Your results include detailed timings for training, sampling, and evaluations.

synthesizer        init_time    preprocess_time    fit_processed_time    sample_time    diagnostic_time    diagnostic_score    quality_time
DayZSynthesizer    0.0009       None               None                  1.23           None               None                None
HMASynthesizer     0.00098      12.34              456.789               234.567        1.23               1.0                 234.12
HSASynthesizer     0.0008       12.45              34.566                23.456         1.25               1.0                 239.45
Expand to see the description for each column

  • synthesizer: The name of the synthesizer. The first row contains the DayZSynthesizer, which is used for creating the random data. Any subsequent rows include the results for the different multi-table synthesizers you are testing.

  • init_time: The time it takes to initialize the synthesizer

  • preprocess_time: The time it takes to preprocess the data, getting into a ready state for modeling, in seconds

  • fit_processed_time: The time it takes to train a model using the processed data, in seconds.

  • sample_time: The time it takes to generate synthetic data from the trained model, in seconds. This step generates synthetic data that is the same size as the input data.

Output folder

Your output folder contains the final results in results.csv, the random DayZ data, as well as each diagnostic reports for each synthesizer.

my_performance_results/
|--- results.csv
|--- DayZ-Data/
       |--- users.csv
       |--- transactions.csv
|--- Diagnostic-Reports/
       |--- hsa_diagnostic.pkl
       |--- independent_diagnostic.pkl
...

FAQ

What kinds of results are expected for the DayZSynthesizer?

The DayZSynthesizer is a special synthesizer used for boostrapping. It does not use machine learning and is only capable of creating data from scratch. Therefore, you should only see results for init_time (for initializing the synthesizer) and sample_time (for sampling random data).

It does not make sense to run diagnostics or measure the quality of random data, so these columns will also remain blank.

Why do the results contain the diagnostic score but not the quality score?

Meanwhile, the quality score measures the statistical similarity between the input data and synthetic data. It's important to know how long this will take, but the actual score is not meaningful since the function is inputting random data.

diagnostic_time: The time it takes to run the for the synthetic data, in seconds.

diagnostic_score: The final score of the , on a scale from 0 (worst) to 1 (best). We expect that all SDV synthesizers produce synthetic data with a score of 1.0

quality_time: The time it takes to run the , in seconds.

All SDV synthesizers are meant to gaurantee a diagnostic score of 1.0 regardless of the data that they are trained on. So even for random data, it is good to check that this score is 1.0 -- and if it's not, please so our team can take a look.

DiagnosticReport
DiagnosticReport
QualityReport
file a bug
DayZSynthesizer
Metadata

*SDV Enterprise Feature. This feature is only available for licensed, enterprise users. For more information, visit our page to Explore SDV.