Explore the Artifacts

Along with providing the Results Summary, SDGym also saves the intermediary artifacts it creates along the way — this includes the synthesizer objects themselves and the synthetic data they created for each dataset/synthesizer pair.

Folder Structure

When running the SDGym benchmark locally or on AWS, you can include an output_destination folder to save the artifacts. The benchmark adds a sub-folder corresponding to the day you've run the benchmark, denoted as SDGym_results_<MM>_<DD>_<YYYY>. Inside the sub-folder, you'll see one folder per dataset, labeled as <dataset_name>_<MM>_<DD>_<YYYY> and per synthesizer. Inside it are all the artifacts for that synthesizer/dataset combination.

  • Synthesizers are stored as <synthesizer_name>.pkl

  • The synthetic data is stored as <synthesizer_name>_synthetic_data.csv

  • Individual results for a synthesizer/dataset pair are stored in a single row in the <synthesizer_name>_benchmark_result.csv file.

An example of this folder structure is shown below on an SDGym benchmark run on June 24, 2025:

output_destination/
|--- SDGym_results_06_24_2025/
     |--- census_06_24_2025/
          |--- CTGANSynthesizer/  
               |--- CTGANSynthesizer.pkl
               |--- CTGANSynthesizer_synthetic_data.csv
               |--- CTGANSynthesizer_benchmark_result.csv
          |--- TVAEynthesizer/  
               |--- TVAESynthesizer.pkl
               |--- TVAESynthesizer_synthetic_data.csv
               |--- TVAESynthesizer_benchmark_result.csv
          |--- GaussianCopulaSynthesizer/  
               |--- ...
     |--- expedia_hotel_logs_06_24_2025/
          |--- CTGANSynthesizer/  
               |--- ...
          |--- TVAESynthesizer/  
               |--- ...
     |--- meta.yaml
     |--- results.csv

You'll also notice two additional files in the folder:

  • The results.csv file contains the Results Summary. This is the consolidation of all the individual <synthesizer_name>_benchmark_result.csv files.

  • The meta.yaml file contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information. This file is useful for debugging purposes.

Are you running a benchmark regularly? We recommend writing to the same output_destination folder every time. Each benchmark run will be stored in a different folder, ready for you to explore and compare results. For example:

output_destination/
|--- SDGym_results_06_24_2025/
     |--- <dataset> ...
     |--- meta.yaml
     |--- results.csv
|--- SDGym_results_07_24_2025/
     |--- <dataset> ...
     |--- meta.yaml
     |--- results.csv

Results Explorer

Use the SDGymResultsExplorer class to programmatically navigate the output destination folder and access the artifacts. To start, create the object by providing the path to the output destination folder.

from sdgym import ResultsExplorer

my_results_explorer = ResultsExplorer(path='my_files/output_destination/')

Once you've created the object, you can use any of the functions below to navigate and access the artifacts.

list

Use this function to list out all the different SDGym results that exist within the folder. Each result corresponds to a sub-folder named SDGym_results_<DD>_<MM>_<YYYY>.

Parameters: (None)

Output: A list of strings corresponding to each of the SDGym benchmarking results

my_results_explorer.list()

[ 'SDGym_results_06_24_2025', 'SDGym_results_07_24_2025', ... ]

load_metainfo

Use this function to load the meta.yaml file contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information.

Parameters:

  • (required) results_folder_name: A string with the name of folder that contains the run. Use the list function to get a list of possible options.

Output: A Python dictionary containing the contents of the metainfo from all the runs that occurred during that day. This information can be useful for debugging.

metainfo = my_results_explorer.load_metainfo(
    results_folder_name='SDGym_results_06_24_2025')
{
  'run_06_24_2025_0': {
    'sdgym_version': '0.10.0',
    'sdv_version': '1.23.0',
    'starting_date': '06_24_2025 18:05:35',
    'completed_date': '06_24_2025 18:06:23',
    'jobs': [
      ('alarm', 'GaussianCopulaSynthesizer'),
      ('census', 'GaussianCopulaSynthesizer'),
      ('census', 'CTGANSynthesizer'),
      ...]}}

load_results

Use this function to load the results summary from a benchmarking run on a given day.

Parameters:

  • (required) results_folder_name: A string with the name of folder that contains the run. Use the list function to get a list of possible options.

Output: A pandas DataFrame object containing the results from that day. For more information about the results, see the Results Summary guide.

results_summary = my_results_explorer.load_results(
    results_folder_name='SDGym_results_06_24_2025')
Synthesizer                Dataset   Dataset_Size_MB   Train_Time   ...
GaussianCopulaSynthesizer  alarm     34.5              123.56       ...
GaussianCopulaSynthesizer  census    130.2             2356.12      ...
CTGANSynthesizer           alarm     34.5              NaN          ...
CTGANSynthesizer           census    130.2             3140.4       ...
UniformSynthesizer         alarm     34.5              1.1          ...
UniformSynthesizer         census    130.2             15.5         ...

In most cases, the results summary is the same as the results.csv file. However, if you've performed multiple runs on the same day, then SDGym will concatenate the results from all runs on that day. For more information, see the FAQ below.

load_synthesizer

Use this function to load a synthesizer object created for a dataset during a particular benchmark.

Parameters:

  • (required) results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.

  • (required) dataset_name: A string with the name of the dataset that the synthesizer was trained on.

  • (required) synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.

Output: The synthesizer object that was created during the provided SDGym benchmarking run. The synthesizer has already been fitted on the given dataset.

my_synthesizer = my_results_explorer.load_synthesizer(
    results_folder_name='SDGym_results_06_24_2025',
    dataset_name='alarm',
    synthesizer_name='GaussianCopulaSynthesizer'
)

load_synthetic_data

Use this function to load the synthetic data created for a dataset during a particular benchmark.

Parameters:

  • (required) results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.

  • (required) dataset_name: A string with the name of the dataset that the benchmark created synthetic data for.

  • (required) synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.

Output: A pandas.DataFrame object containing the synthetic data that corresponds to the given dataset for the given SDGym benchmarking run.

my_synthetic_data = my_results_explorer.load_synthetic_data(
    results_folder_name='SDGym_results_06_24_2025',
    dataset_name='alarm',
    synthesizer_name='GaussianCopulaSynthesizer'
)

load_real_data

Use this function to load the original dataset that is used for benchmarking SDGym. The dataset is the same regardless of the SDGym benchmark.

Parameters:

  • (required) dataset_name: A string with the name of the dataset. For more information, see the Datasets reference.

Output: A pandas.DataFrame object containing the original dataset used by SDGym

original_dataset = my_results_explorer.load_real_data(
    dataset_name='alarm'
)

FAQs

If I run SDGym multiple times in a day, which artifacts are stored?

Synthesizer and synthetic data artifacts are always written to the same folder during a given day. If there are multiple runs within a day, SDGym will not override the files. Instead, it will generate new files with suffixes (1), (2), (3) and so on.

The example below shows what happens if you run SDGym twice — first with the CTGANSynthesizer and then with the TVAESynthesizer. The synthesizers and synthetic data are stored within their respective tables. But you'll see multiple meta.yaml and results.csv files, corresponding to each run.

output_destination/
|--- SDGym_results_06_24_2025/
     |--- census_06_24_2025/
          |--- CTGANSynthesizer/  
               |--- ...
          |--- TVAEynthesizer/  
               |--- ...
     |--- expedia_hotel_logs_06_24_2025/
          |--- CTGANSynthesizer/  
               |--- ...
          |--- TVAESynthesizer/  
               |--- ...
     |--- meta.yaml
     |--- results.csv
     |--- meta(1).yaml
     |--- results(1).csv

Last updated