Custom Datasets
This guide provides instructions for including your own, custom datasets into the SDGym benchmarking framework.
Custom Dataset Requirements
You can add any number of custom datasets that represent single table data.
Dataset Format
Each dataset must have:
Data, stored as a single CSV file with a name ending in
.csv
Metadata, stored as a JSON file named
metadata.json
File Structure
The SDGym is optimized for applying multiple custom datasets to the benchmarking framework. Please convert all custom dataset into the following file structure:
Compress the CSV and JSON file for each dataset into a single
zip
filePut all the zip files into a single folder for all your custom datasets
The overall structure is illustrated below.
Using your custom datasets
To use your custom datasets, supply the path to the overall folder using the additional_datasets_folder
parameter.
Your folder can be stored on computer locally or it can be an Amazon S3 bucket.
Local Path
If you have the datasets folder stored on your machine, provide the folder's path as a string.
AWS S3 Filepath
If your datasets folder is an Amazon S3 bucket, you can provide the name of the bucket instead prefixed with 's3://'
. For more information, see the docs for AWS Integration.
Results
A result for each dataset and synthesizer will be able after the benchmarking finishes. You can identify each dataset by the name of each zip file.
Last updated