Custom Datasets

This guide provides instructions for including your own, custom datasets into the SDGym benchmarking framework.

Custom Dataset Requirements

You can add any number of custom datasets that represent single table data.

Dataset Format

Each dataset must have:

  • Data, stored as a single CSV file with a name ending in .csv

  • Metadata, stored as a JSON file named metadata.json

File Structure

The SDGym is optimized for applying multiple custom datasets to the benchmarking framework. Please convert all custom dataset into the following file structure:

  1. Compress the CSV and JSON file for each dataset into a single zip file

  2. Put all the zip files into a single folder for all your custom datasets

The overall structure is illustrated below.

data/my_custom_datasets/
|
|-- my_users.zip
|   |-- my_users.csv
|   |-- metadata.json
|
|-- my_financials.zip
|   |-- my_financials.csv
|   |-- metadata.json

Using your custom datasets

To use your custom datasets, supply the path to the overall folder using the additional_datasets_folder parameter.

Your folder can be stored on computer locally or it can be an Amazon S3 bucket.

Local Path

If you have the datasets folder stored on your machine, provide the folder's path as a string.

import sdgym

sdgym.benchmark_single_table(
    additional_datasets_folder='/data/my_custom_datsets/'
)

Amazon S3 Integration

If your datasets folder is an Amazon S3 bucket, you can provide the name of the bucket instead prefixed with 's3://'.

import sdgym

sdgym.benchmark_single_table(
    additional_datasets_folder='s3://my-demo-bucket'
)

Is your S3 bucket private? Authenticate into your Amazon S3 account first using environment variables.

import sdgym

# use environment variables to authenticate into your Amazon S3
AWS_ACCESS_KEY_ID = 'XXX'
AWS_SECRET_ACCESS_KEY = 'XXX'
AWS_SESSION_TOKEN = 'XXX' # optional

# now you can supply the URL to your private bucket
sdgym.benchmark_single_table(
    additional_datasets_folder='s3://my-private-demo-bucket'
)

Results

A result for each dataset and synthesizer will be able after the benchmarking finishes. You can identify each dataset by the name of each zip file.

Synthesizer                Dataset          Dataset_Size_MB   Model_Time   Peak_Memory_KB   Model_Size_MB    Sample_Time    Evaluate_Time   Quality Score   NewRowSynthesis
FASTMLPreset               my_users         34.5              45.45        100201           0.340            2012.2         1001.2          0.71882         0.99901           
FASTMLPreset               my_financials    130.2             200.691      100231           0.450            2012.2         1012.2          0.88191         1.0
GaussianCopulaSynthesizer  my_users         34.5              123.56       300101           0.981            2012.1         1001.2          0.9991991       0.998191        
GaussianCopulaSynthesizer  my_financials    130.2             23546.12     201011           1.232            2012.2         101012.1        0.689101        1.0
...

Last updated

© Copyright 2023, DataCebo, Inc.