Public SDV Datasets

The SDGym library includes a variety of public, demo datasets that you can use from benchmarking. These come from our main SDV library.

Available Datasets

View all the datasets that are available through the get_available_datasets function.

get_available_datasets

See all the publicly available demo datasets that are available to use.

Parameters

modality: A string describing the type of data. At this time, the only supported modality is 'single_table'.

Returns A pandas DataFrame object that describes the dataset name, dataset size and number of tables.

sdgym.get_available_datasets(modality='single_table')

dataset_name        size_MB        num_tables
KRK_v1              0.072128       1
adult	            3.907448	   1
alarm	            4.520128	   1
asia	            1.280128	   1
...

The reported dataset size is based on fully loading the data into Python. You may find slight deviations between the CSV file size and reported size.

Recommended Datasets

By default, the benchmarking includes 9 of the available datasets. These datasets were chosen as examples of rich data that you may find in real world settings. They of substantial size, contain a variety of columns and meet the SDGym standards for single table data.

Dataset

Description

adult

Attributes corresponding to real adults in the 1994 US census ↗

alarm

Simulated data for an alarm messaging system when monitoring patients ↗

census

US census data extracted from 1994 and 1995 ↗

child

Health properties corresponding to different patients

covtype

Information about forest covers in different regions of the world ↗

expedia_hotel_logs

Web logs of corresponding to a random selection of Expedia users browsing the website ↗

insurance

Simulated data about various student drivers and their vehicles ↗

intrusion

Network traffic that contains simulated attacks on a U.S. air force LAN ↗

news

Attributes about published news articles

Benchmarking the datasets

You can benchmark any of the publicly available datasets by providing their string names into the sdv_datasets parameter.

import sdgym

sdgym.benchmark_single_table(
    sdv_datasets=['intrusion', 'KRK_v1']
)

Want to include your own datasets? See the Custom Datasets section for more information.

Last updated 2 years ago