Public SDV Datasets

The SDGym library includes a variety of public, demo datasets that you can use from benchmarking. These come from the overall SDV ecosystem.

These datasets are stored in a publicly readable S3 bucket created by DataCebo. For more information, see the Dataset Format guide.

Using the demo datasets in SDGym

SDGym is configured to use the demo datasets by default.

Exploring Datasets

The default DatasetExplorer reads the SDV demo datasets. For more information, see the Explore Datasets guide.

from sdgym import DatasetExplorer

explorer = DatasetExplorer()
summary = explorer.summarize_datasets(modality='single_table')

Benchmarking

The benchmark functions are set to run on the recommended demo datasets by default. You can update these using the sdv_datasets parameter. For more information see the guide for Running a Benchmark (AWS).

import sdgym

results = sdgym.benchmark_single_table_aws(
    sdv_datasets=['adult', 'alarm', 'census', 'child', 'expedia_hotel_logs'],
    output_destination='s3://my_results_bucket/'
)

Last updated