Local Runs
This page guides you through running the SDGym benchmark locally. To run on the cloud, please see the guide for AWS Runs.
import sdgym
results_summary = sdgym.benchmark_single_table(
output_destination='my_sdgym_results/',
)See Interpreting Results for a description of the benchmarking results.
Optional Parameters
Every step of the benchmarking process is customizable. Use the optional parameters to control the setup, execution and evaluation.
Setup
Use these parameters to control which synthesizers and datasets to include in the benchmark.
synthesizers: Control which SDV synthesizers to use by supplying a list of strings with the synthesizer names
(default)
['GaussianCopulaSynthesizer', 'CTGANSynthesizer', 'UniformSynthesizer']Options include
'GaussianCopulaSynthesizer', 'CTGANSynthesizer','TVAESynthesizer'and'CopulaGANSynthesizer', and many more. See Synthesizers for more details.
sdgym.benchmark_single_table(synthesizers=['GaussianCopulaSynthesizer', 'TVAESynthesizer'])Simulating graceful degradation. SDGym always runs the UniformSynthesizer as a backup synthesizer, even if it is explicitly specified. This backup synthesizer is used to simulate graceful degradation in an enterprise setting. For more information, see Graceful Handling of Errors.
custom_synthesizers: Supply your own custom synthesizers and variants using a list of classes.
(default)
None: Do not run the benchmark on any custom synthesizersTo create your own class, see CustomSynthesizers. You can also create a variant of an SDV Synthesizer.
sdv_datasets: Control which of the SDV demo datasets to use by supplying their names as a list of strings.
(default)
['adult', 'alarm', 'census', 'child', 'expedia_hotel_logs', 'insurance', 'intrusion', 'news', 'covtype']See Datasets for more options
additional_datasets_folder: You can also supply the name of a local folder containing your own datasets. If your folder is on AWS (S3), see our guide on Running a Benchmark on AWS.
(default)
None: Do not run the benchmark for any additional datasets.<string>: The path to your folder that contains additional datasets. Make sure your datasets are in the correct format and that you have the proper authentications to access the folder. See Custom Datasets for more details.
Execution
Use these parameters to control speed and flow of the benchmarking.
limit_dataset_size: Set this boolean to limit the size of every dataset. This will yield faster results but may affect the overall quality.
(default)
False: Use the full datasets for benchmarking.True: Limit the dataset size before benchmarking. For every dataset selected, use only 100 rows (randomly sampled) and the first 10 columns.
timeout: The maximum number of seconds to give to each synthesizer to train and sample a dataset
(default)
None: Do not set a maximum. Allow the synthesizer to take as long as it needs.<integer>: Allow a synthesizer to run on the integer number of seconds for each dataset. If the synthesizer is exceeding the time, the benchmark will report aTimeoutError.
show_progress: Show the incremental progress of running the script
(default)
False: Do not show the progress. Nothing will be printed on the screen.True: Print a progress bar to indicate the completion of the benchmarking.
output_destination: Supply the name of the folder where you'd like to save the final results, as well as all the detailed artifacts created in the process.
(default)
None: Do not save any of the results.<string>: Store the final results and any detailed artifacts created during the process. For more information about the folder, please see the guide on Detailed Results.
Evaluation
Use the evaluation parameters to control what to measure when benchmarking.
The SDGym benchmark will always measure performance (time and memory). Use additional parameters to evaluate other aspects of the synthetic data after it's created.
compute_diagnostic_store:Set this boolean to generate an overall diagnostic score for every synthesizer and dataset. This may increase the benchmarking time.
(default)
True: Compute an overall diagnostic score. See the SDMetrics Diagnostic Report for more details.False: Do not compute a diagnostic score.
compute_quality_score: Set this boolean to generate an overall quality score for every synthesizer and dataset. This may increase the benchmarking time.
(default)
True: Compute an overall quality score. See the SDMetrics Quality Report for more details.False: Do not compute a quality score.
compute_privacy_score: Set this boolean to generate an overall privacy score for every synthesizer and dataset. This may increase the benchmarking time.
(default)
True: Compute the privacy score. See the DCRBaselineProtection metric for more details.False: Do not compute a privacy score
sdmetrics: Provide a list of strings corresponding to metrics from the SDMetrics library.
(default)
None: Do not apply any additional metrics.See the SDMetrics library for more metric options
Examples
Running a quick trial to ensure the benchmark works:
import sdgym
results = sdgym.benchmark_single_table(
limit_dataset_size=True,
timeout=600,
compute_quality_score=False,
compute_privacy_score=False
)Running a detailed benchmark with custom evaluation metrics:
import sdgym
results = sdgym.benchmark_single_table(
output_destination='my_sdgym_results/',
sdmetrics=[
'MissingValueSimilarity',
'RangeCoverage'
]
)Last updated