Dataset Format

This guide describes the format that the SDV ecosystem requires of any datasets. You can supply your own datasets to SDGym as long as they conform to the format. (The public SDV datasets are already present in this format.)

Data Format Requirements

SDGym reads from datasets stored in an Amazon S3 bucket.

Folder Format

Your datasets should be broken up by modality — single_table, multi_table or sequential. Within each modality, there should be a folder per dataset. The folder's name is considered to be the name of the dataset.

This is shown in the structure below.

root_folder/ (local or S3 bucket)
|--- single_table/
     |--- dataset1/
          |--- data.zip
          |--- metadata.json
          |--- metainfo.yaml
     |--- dataset2/
          |--- data.zip
          |--- metadata.json
     |--- ...
|--- multi_table/
     |--- dataset3/
          |--- data.zip
          |--- metadata.json
          |--- metainfo.yaml
     |--- dataset4/
          |--- data.zip
          |--- metadata.json
          |--- metainfo.yaml
     |--- ...
|--- sequential/
     |--- ...

Dataset Folder

Within each dataset folder, 3 files are required.

(required) data.zip: This compressed zip file contains the actual data, which is stored as 1 or more CSV files. Each CSV file corresponds to a table of data. It should be named in the format: <table_name>.csv.
(required) metadata.json: The SDV metadata that describes the tables and columns in the data. This should be stored as a JSON file. For more information, see the SDV docs.
(required) metainfo.yaml: This YAML file contains additional information about the dataset as a whole. This is useful for keeping track of characteristics of the dataset. See the section below for additional details about the yaml file.

Additional files are optional. The SDV ecosystem has a specific guidance for optional README and SOURCE files which you can find below. Note that these files will not impact SDGym's benchmarking performance.

README.txt.

The README is a freeform text file with more information about what the data means. This can contain information about what certain variables mean, rules about the dataset, constraints, etc. Examples of what to include in a README:

If the original dataset author provided an explanation about what the column means, how to interpret the category labels, etc. then we should copy that information
If we've discovered interesting constraints and business rules about the dataset, we should also be adding those
If there's any other information about the recommended usage of the dataset, add it to the README too

Below is an example README of a dataset.

README.txt for adult dataset

Description: This dataset contains US census records from 1994. Extraction was done by 
Barry Becker from the 1994 Census database.  A set of reasonably clean records was 
extracted using the following conditions: 
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Usage: This dataset was primarily meant to be used for building a preditive ML model.
The prediction task is to determine whether a person's income is over $50,000 a year.

SOURCE.txt

A freeform text file with more information about the provenance of the dataset. This is useful for sharing more information about the original authors of the dataset.

Below is an example of what the SOURCE.txt file can look like for a dataset. It can contain the license name, original source URLs for the dataset, any provided citations, etc.

License name: Creative Commons Attribution 4.0 International (CC BY 4.0)
Source URL: https://archive.ics.uci.edu/dataset/2/adult

Citation:
[1] Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. 
[2] https://doi.org/10.24432/C5XW20

Metainfo YAML file

The metainfo YAML file contains labels and tags that helps SDV categorize and understand the provenance of the dataset. Required fields are described below:

(required) dataset-name: Name of dataset
(required) dataset-bucket: Name of bucket that it's in
(required) modality: The type of data. Options are: single-table, multi-table or sequential
(required) num-tables: Number of tables
(required) dataset-size-mb: Size of all the tables of the data added up, in MB

An example of this is shown below.

# example metainfo: adult.yaml
dataset-name: adult
dataset-bucket: sdv-datasets-public
modality: single-table
num-tables: 1
dataset-size-mb: 23.32

Note that the SDV public datasets may include additional YAML flags such as source-url that are used for tracking. They do not impact the performance of the SDGym benchmark.

Using your custom datasets in SDGym

To use your custom datasets, within SDGym, supply the S3 bucket URL. The URL should start with s3://. If your bucket is private, SDGym is able to read from it as long as you provide the URL. If it is private, you will have to provide credentials (AWS access key id and secret access key) with read-access.

Exploring Datasets

Create a DatasetExplorer that points to your S3 bucket along. You can also pass in your credentials if your bucket is private. For more information, see the Explore Datasets guide.

from sdgym import DatasetExplorer

explorer = DatasetExplorer(
  s3_url='s3://my_bucket/',
  aws_access_key_id='my_access_key',
  aws_secret_access_key='my_secret'
)
summary = explorer.summarize_datasets(modality='single_table')

Benchmarking

When benchmarking synthesizers on AWS, you can point to your bucket using the additional_datasets_folder parameter. You can also pass in your credentials if your bucket is private. We also recommend you to point to a different S3 bucket for writing the results.

For more information see the guide for Running a Benchmark (AWS).

import sdgym

results = sdgym.benchmark_single_table_aws(
    additional_datasets_folder='s3://my_bucket/'
    aws_access_key_id='my_access_key',
    aws_secret_access_key='my_secret',
    output_destination='s3://my_results_bucket/'
)

Last updated 23 days ago