Explore the Artifacts
Along with providing the Results Summary, SDGym also saves the intermediary artifacts it creates along the way — this includes the synthesizer objects themselves and the synthetic data they created for each dataset/synthesizer pair.
Folder Structure
When running the SDGym benchmark locally or on AWS, you can include an output_destination folder to save the artifacts. The benchmark splits the results into single-table and multi-table runs. Within each, it adds a sub-folder corresponding to the day you've run the benchmark, denoted as SDGym_results_<MM>_<DD>_<YYYY>. Inside the sub-folder, you'll see one folder per dataset, labeled as <dataset_name>_<MM>_<DD>_<YYYY> and per synthesizer. Inside it are all the artifacts for that synthesizer/dataset combination.
Synthesizers are stored as
<synthesizer_name>.pklThe synthetic data is stored as
<synthesizer_name>_synthetic_data.csvIndividual results for a synthesizer/dataset pair are stored in a single row in the
<synthesizer_name>_benchmark_result.csvfile.
An example of this folder structure is shown below on an SDGym benchmark run on June 24, 2025:
output_destination/
|--- single-table/
|--- SDGym_results_06_24_2025/
|--- census_06_24_2025/
|--- CTGANSynthesizer/
|--- CTGANSynthesizer.pkl
|--- CTGANSynthesizer_synthetic_data.csv
|--- CTGANSynthesizer_benchmark_result.csv
|--- GaussianCopulaSynthesizer/
|--- ...
|--- expedia_hotel_logs_06_24_2025/
|--- CTGANSynthesizer/
|--- ...
|--- TVAESynthesizer/
|--- ...
|--- metainfo.yaml
|--- results.csv
|--- multi-table/
|--- SDGym_results_06_24_2025/
|--- ...You'll also notice two additional files in the folder:
The
results.csvfile contains the Results Summary. This is the consolidation of all the individual<synthesizer_name>_benchmark_result.csvfiles.The
metainfo.yamlfile contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information. This file is useful for debugging purposes.
Are you running a benchmark regularly? We recommend writing to the same output_destination folder every time. Each benchmark run will be stored in a different folder, ready for you to explore and compare results. For example:
Results Explorer
Use the ResultsExplorer class to programmatically navigate the output destination folder and access the artifacts.
Parameters:
(required)
path: The filepath to the top-level folder containing all the results(required)
modality: Choose either'single_table'or'multi_table'to view the right results
Once you've created the object, you can use any of the functions below to navigate and access the artifacts.
list
Use this function to list out all the different SDGym results that exist within the folder. Each result corresponds to a sub-folder named SDGym_results_<DD>_<MM>_<YYYY>.
Parameters: (None)
Output: A list of strings corresponding to each of the SDGym benchmarking results
load_metainfo
Use this function to load the metainfo.yaml file contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information.
Parameters:
(required)
results_folder_name: A string with the name of folder that contains the run. Use thelistfunction to get a list of possible options.
Output: A Python dictionary containing the contents of the metainfo from all the runs that occurred during that day. This information can be useful for debugging.
load_results
Use this function to load the results summary from a benchmarking run on a given day.
Parameters:
(required)
results_folder_name: A string with the name of folder that contains the run. Use thelistfunction to get a list of possible options.
Output: A pandas DataFrame object containing the results from that day. For more information about the results, see the Results Summary guide.
In most cases, the results summary is the same as the results.csv file. However, if you've performed multiple runs on the same day, then SDGym will concatenate the results from all runs on that day. For more information, see the FAQ below.
load_synthesizer
Use this function to load a synthesizer object created for a dataset during a particular benchmark.
Parameters:
(required)
results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.(required)
dataset_name: A string with the name of the dataset that the synthesizer was trained on.(required)
synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.
Output: The synthesizer object that was created during the provided SDGym benchmarking run. The synthesizer has already been fitted on the given dataset.
load_synthetic_data
Use this function to load the synthetic data created for a dataset during a particular benchmark.
Parameters:
(required)
results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.(required)
dataset_name: A string with the name of the dataset that the benchmark created synthetic data for.(required)
synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.
Output:
For a single-table dataset, this is a pandas.DataFrame object containing the synthetic data that corresponds to the given dataset for the given SDGym benchmarking run.
For a multi-table dataset, this is a dictionary that maps each table name (string) to the pandas.DataFrame object containing the synthetic data.
load_real_data
Use this function to load the original dataset that is used for benchmarking SDGym. The dataset is the same regardless of the SDGym benchmark.
Parameters:
(required)
dataset_name: A string with the name of the dataset. For more information, see the Datasets reference.
Output:
For a single-table dataset, this is a pandas.DataFrame object containing the real dat
For a multi-table dataset, this is a dictionary that maps each table name (string) to the pandas.DataFrame object containing the real data.
FAQs
If I run SDGym multiple times in a day, which artifacts are stored?
Synthesizer and synthetic data artifacts are always written to the same folder during a given day. If there are multiple runs within a day, SDGym will not override the files. Instead, it will generate new files with suffixes (1), (2), (3) and so on.
The example below shows what happens if you run SDGym twice — first with the CTGANSynthesizer and then with the TVAESynthesizer. The synthesizers and synthetic data are stored within their respective tables. But you'll see multiple metainfo.yaml and results.csv files, corresponding to each run.
Last updated