Public SDV Datasets
The SDGym library includes a variety of public, demo datasets that you can use from benchmarking. These come from our main SDV library.
Available Datasets
View all the datasets that are available through the get_available_datasets
function.
get_available_datasets
See all the publicly available demo datasets that are available to use.
Parameters
modality
: A string describing the type of data. At this time, the only supported modality is'single_table'
.
Returns A pandas DataFrame object that describes the dataset name, dataset size and number of tables.
The reported dataset size is based on fully loading the data into Python. You may find slight deviations between the CSV file size and reported size.
Recommended Datasets
By default, the benchmarking includes 9 of the available datasets. These datasets were chosen as examples of rich data that you may find in real world settings. They of substantial size, contain a variety of columns and meet the SDGym standards for single table data.
adult
alarm
census
child
Health properties corresponding to different patients
covtype
expedia_hotel_logs
insurance
intrusion
news
Attributes about published news articles
Benchmarking the datasets
You can benchmark any of the publicly available datasets by providing their string names into the sdv_datasets
parameter.
Want to include your own datasets? See the Custom Datasets section for more information.
Last updated