Public SDV Datasets
The SDGym library includes a variety of public, demo datasets that you can use from benchmarking. These come from the overall SDV ecosystem.
These datasets are stored in a publicly readable S3 bucket created by DataCebo. For more information, see the Dataset Format guide.
Recommended Datasets
By default, the benchmarking includes 9 of the available datasets. These datasets were chosen as examples of rich data that you may find in real world settings. They of substantial size, contain a variety of columns and meet the SDGym standards for single table data.
adult
Attributes corresponding to real adults in the 1994 US census ↗
alarm
Simulated data for an alarm messaging system when monitoring patients ↗
census
US census data extracted from 1994 and 1995 ↗
child
Health properties corresponding to different patients
covtype
Information about forest covers in different regions of the world ↗
expedia_hotel_logs
Web logs of corresponding to a random selection of Expedia users browsing the website ↗
insurance
Simulated data about various student drivers and their vehicles ↗
intrusion
Network traffic that contains simulated attacks on a U.S. air force LAN ↗
news
Attributes about published news articles
Using the demo datasets in SDGym
SDGym is configured to use the demo datasets by default.
Exploring Datasets
The default DatasetExplorer reads the SDV demo datasets. For more information, see the Explore Datasets guide.
from sdgym import DatasetExplorer
explorer = DatasetExplorer()
summary = explorer.summarize_datasets(modality='single_table')Benchmarking
The benchmark functions are set to run on the recommended demo datasets by default. You can update these using the sdv_datasets parameter. For more information see the guide for Running a Benchmark (AWS).
import sdgym
results = sdgym.benchmark_single_table_aws(
sdv_datasets=['adult', 'alarm', 'census', 'child', 'expedia_hotel_logs'],
output_destination='s3://my_results_bucket/'
)Last updated