Public SDV Datasets

The SDGym library includes a variety of public, demo datasets that you can use from benchmarking. These come from the overall SDV ecosystem.

These datasets are stored in a publicly readable S3 bucket created by DataCebo. For more information, see the Dataset Format guide.

By default, the benchmarking includes 9 of the available datasets. These datasets were chosen as examples of rich data that you may find in real world settings. They of substantial size, contain a variety of columns and meet the SDGym standards for single table data.

Dataset
Description

adult

Attributes corresponding to real adults in the 1994 US census

alarm

Simulated data for an alarm messaging system when monitoring patients

census

US census data extracted from 1994 and 1995

child

Health properties corresponding to different patients

covtype

Information about forest covers in different regions of the world

expedia_hotel_logs

Web logs of corresponding to a random selection of Expedia users browsing the website

insurance

Simulated data about various student drivers and their vehicles

intrusion

Network traffic that contains simulated attacks on a U.S. air force LAN

news

Attributes about published news articles

Using the demo datasets in SDGym

SDGym is configured to use the demo datasets by default.

Exploring Datasets

The default DatasetExplorer reads the SDV demo datasets. For more information, see the Explore Datasets guide.

from sdgym import DatasetExplorer

explorer = DatasetExplorer()
summary = explorer.summarize_datasets(modality='single_table')

Benchmarking

The benchmark functions are set to run on the recommended demo datasets by default. You can update these using the sdv_datasets parameter. For more information see the guide for Running a Benchmark (AWS).

import sdgym

results = sdgym.benchmark_single_table_aws(
    sdv_datasets=['adult', 'alarm', 'census', 'child', 'expedia_hotel_logs'],
    output_destination='s3://my_results_bucket/'
)

Last updated