＊ Performance Estimates

＊SDV Enterprise Feature. This feature is only available for licensed, enterprise users. For more information, visit our page to Compare SDV Features.

How well will SDV synthesizers be able to model your full data schema? Use this feature to get some estimates with only your metadata.

＊ create_and_test_multi_table

Simulate the performance of different multi-table synthesizers using your metadata.

This function uses the DayZSynthesizer to create random data. Then it runs the random data through the different multi-table synthesizers to estimate their performance, as well as the different evaluation reports.

from sdv.utils.multi_table import create_and_test_multi_table

create_and_test_multi_table(
  metadata=my_metadata,
  synthesizers=['HMASynthesizer', 'HSASynthesizer'],
  output_folder='my_performance_results/',
  default_num_rows=1_000_000,
  timeout=3600 # 1 hour per synthesizer
)

Parameters:

(required) metadata: A Metadata object
(required) synthesizers: A list of strings representing the multi-table synthesizers that you want to test. Options are: 'HMASynthesizer', 'HSASynthesizer' or 'IndependentSynthesizer'
(required) output_folder: A destination folder where the random data, results, and other artifacts will be saved
default_num_rows: An integer with the number of rows to create by default for all tables
- (default) 1000: Create 1000 rows for every table
num_rows_per_table: A dictionary that maps each table name to the number of rows to create for only that table. Values here will override the default num rows set in the previous parameter
- (default) None: Do not override the default number of rows for any individual table
timeout: The maximum number of seconds to give to each synthesizer to train and sample the dataset
- (default) None: Do not set a maximum. Allow the synthesizer to take as long as it needs.
- <integer>: Allow a synthesizer to run on the integer number of seconds for each dataset. If the synthesizer is exceeding the time, the output will include a TimeoutError.

Output A pandas DataFrame with detailed performance results from each synthesizer

Interpreting the results

Your results include detailed timings for training, sampling, and evaluations.

synthesizer        init_time    preprocess_time    fit_processed_time    sample_time    diagnostic_time    diagnostic_score    quality_time
DayZSynthesizer    0.0009       None               None                  1.23           None               None                None
HMASynthesizer     0.00098      12.34              456.789               234.567        1.23               1.0                 234.12
HSASynthesizer     0.0008       12.45              34.566                23.456         1.25               1.0                 239.45

Expand to see the description for each column

synthesizer: The name of the synthesizer. The first row contains the DayZSynthesizer, which is used for creating the random data. Any subsequent rows include the results for the different multi-table synthesizers you are testing.
init_time: The time it takes to initialize the synthesizer
preprocess_time: The time it takes to preprocess the data, getting into a ready state for modeling, in seconds
fit_processed_time: The time it takes to train a model using the processed data, in seconds.
sample_time: The time it takes to generate synthetic data from the trained model, in seconds. This step generates synthetic data that is the same size as the input data.
diagnostic_time: The time it takes to run the DiagnosticReport for the synthetic data, in seconds.
diagnostic_score: The final score of the DiagnosticReport, on a scale from 0 (worst) to 1 (best). We expect that all SDV synthesizers produce synthetic data with a score of 1.0
quality_time: The time it takes to run the QualityReport, in seconds.

Output folder

Your output folder contains the final results in results.csv, the random DayZ data, as well as each diagnostic reports for each synthesizer.

my_performance_results/
|--- results.csv
|--- DayZ-Data/
       |--- users.csv
       |--- transactions.csv
|--- Diagnostic-Reports/
       |--- hsa_diagnostic.pkl
       |--- independent_diagnostic.pkl
...

FAQ

What kinds of results are expected for the DayZSynthesizer?

The DayZSynthesizer is a special synthesizer used for boostrapping. It does not use machine learning and is only capable of creating data from scratch. Therefore, you should only see results for init_time (for initializing the synthesizer) and sample_time (for sampling random data).

It does not make sense to run diagnostics or measure the quality of random data, so these columns will also remain blank.

Why do the results contain the diagnostic score but not the quality score?

All SDV synthesizers are meant to gaurantee a diagnostic score of 1.0 regardless of the data that they are trained on. So even for random data, it is good to check that this score is 1.0 -- and if it's not, please file a bug so our team can take a look.

Meanwhile, the quality score measures the statistical similarity between the input data and synthetic data. It's important to know how long this will take, but the actual score is not meaningful since the function is inputting random data.

PreviousPreprocessing NextSampling

Last updated 9 months ago