Dataset Format
This guide describes the format that the SDV ecosystem requires of any datasets. You can supply your own datasets to SDGym as long as they conform to the format. (The public SDV datasets are already present in this format.)
Data Format Requirements
SDGym reads from datasets stored in an Amazon S3 bucket.
Folder Format
Your datasets should be broken up by modality — single_table, multi_table or sequential. Within each modality, there should be a folder per dataset. The folder's name is considered to be the name of the dataset.
This is shown in the structure below.
root_folder/ (local or S3 bucket)
|--- single_table/
|--- dataset1/
|--- data.zip
|--- metadata.json
|--- metainfo.yaml
|--- dataset2/
|--- data.zip
|--- metadata.json
|--- ...
|--- multi_table/
|--- dataset3/
|--- data.zip
|--- metadata.json
|--- metainfo.yaml
|--- dataset4/
|--- data.zip
|--- metadata.json
|--- metainfo.yaml
|--- ...
|--- sequential/
|--- ...Dataset Folder
Within each dataset folder, 3 files are required.
(required)
data.zip: This compressed zip file contains the actual data, which is stored as 1 or more CSV files. Each CSV file corresponds to a table of data. It should be named in the format:<table_name>.csv.(required)
metadata.json: The SDV metadata that describes the tables and columns in the data. This should be stored as a JSON file. For more information, see the SDV docs.(required)
metainfo.yaml: This YAML file contains additional information about the dataset as a whole. This is useful for keeping track of characteristics of the dataset. See the section below for additional details about the yaml file.
Additional files are optional. The SDV ecosystem has a specific guidance for optional README and SOURCE files which you can find below. Note that these files will not impact SDGym's benchmarking performance.
Metainfo YAML file
The metainfo YAML file contains labels and tags that helps SDV categorize and understand the provenance of the dataset. Required fields are described below:
(required)
dataset-name: Name of dataset(required)
dataset-bucket: Name of bucket that it's in(required)
modality: The type of data. Options are:single-table,multi-tableorsequential(required)
num-tables: Number of tables(required)
dataset-size-mb: Size of all the tables of the data added up, in MB
An example of this is shown below.
# example metainfo: adult.yaml
dataset-name: adult
dataset-bucket: sdv-datasets-public
modality: single-table
num-tables: 1
dataset-size-mb: 23.32Note that the SDV public datasets may include additional YAML flags such as source-url that are used for tracking. They do not impact the performance of the SDGym benchmark.
Using your custom datasets in SDGym
To use your custom datasets, within SDGym, supply the S3 bucket URL. The URL should start with s3://. If your bucket is private, SDGym is able to read from it as long as you provide the URL. If it is private, you will have to provide credentials (AWS access key id and secret access key) with read-access.
Exploring Datasets
Create a DatasetExplorer that points to your S3 bucket along. You can also pass in your credentials if your bucket is private. For more information, see the Explore Datasets guide.
from sdgym import DatasetExplorer
explorer = DatasetExplorer(
s3_url='s3://my_bucket/',
aws_access_key_id='my_access_key',
aws_secret_access_key='my_secret'
)
summary = explorer.summarize_datasets(modality='single_table')Benchmarking
When benchmarking synthesizers on AWS, you can point to your bucket using the additional_datasets_folder parameter. You can also pass in your credentials if your bucket is private. We also recommend you to point to a different S3 bucket for writing the results.
For more information see the guide for Running a Benchmark (AWS).
import sdgym
results = sdgym.benchmark_single_table_aws(
additional_datasets_folder='s3://my_bucket/'
aws_access_key_id='my_access_key',
aws_secret_access_key='my_secret',
output_destination='s3://my_results_bucket/'
)Last updated