Explore the Artifacts
Along with providing the Results Summary, SDGym also saves the intermediary artifacts it creates along the way — this includes the synthesizer objects themselves and the synthetic data they created for each dataset/synthesizer pair.
Folder Structure
When running the SDGym benchmark locally or on AWS, you can include an output_destination folder to save the artifacts. The benchmark adds a sub-folder corresponding to the day you've run the benchmark, denoted as SDGym_results_<MM>_<DD>_<YYYY>. Inside the sub-folder, you'll see one folder per dataset, labeled as <dataset_name>_<MM>_<DD>_<YYYY> and per synthesizer. Inside it are all the artifacts for that synthesizer/dataset combination.
Synthesizers are stored as
<synthesizer_name>.pklThe synthetic data is stored as
<synthesizer_name>_synthetic_data.csvIndividual results for a synthesizer/dataset pair are stored in a single row in the
<synthesizer_name>_benchmark_result.csvfile.
An example of this folder structure is shown below on an SDGym benchmark run on June 24, 2025:
output_destination/
|--- SDGym_results_06_24_2025/
|--- census_06_24_2025/
|--- CTGANSynthesizer/
|--- CTGANSynthesizer.pkl
|--- CTGANSynthesizer_synthetic_data.csv
|--- CTGANSynthesizer_benchmark_result.csv
|--- TVAEynthesizer/
|--- TVAESynthesizer.pkl
|--- TVAESynthesizer_synthetic_data.csv
|--- TVAESynthesizer_benchmark_result.csv
|--- GaussianCopulaSynthesizer/
|--- ...
|--- expedia_hotel_logs_06_24_2025/
|--- CTGANSynthesizer/
|--- ...
|--- TVAESynthesizer/
|--- ...
|--- meta.yaml
|--- results.csvYou'll also notice two additional files in the folder:
The
results.csvfile contains the Results Summary. This is the consolidation of all the individual<synthesizer_name>_benchmark_result.csvfiles.The
meta.yamlfile contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information. This file is useful for debugging purposes.
Results Explorer
Use the SDGymResultsExplorer class to programmatically navigate the output destination folder and access the artifacts. To start, create the object by providing the path to the output destination folder.
from sdgym import ResultsExplorer
my_results_explorer = ResultsExplorer(path='my_files/output_destination/')Once you've created the object, you can use any of the functions below to navigate and access the artifacts.
list
Use this function to list out all the different SDGym results that exist within the folder. Each result corresponds to a sub-folder named SDGym_results_<DD>_<MM>_<YYYY>.
Parameters: (None)
Output: A list of strings corresponding to each of the SDGym benchmarking results
my_results_explorer.list()
[ 'SDGym_results_06_24_2025', 'SDGym_results_07_24_2025', ... ]load_metainfo
Use this function to load the meta.yaml file contains information about the benchmarking run, including the version of SDGym, version of SDV, date the run was finished, and any other relevant information.
Parameters:
(required)
results_folder_name: A string with the name of folder that contains the run. Use thelistfunction to get a list of possible options.
Output: A Python dictionary containing the contents of the metainfo from all the runs that occurred during that day. This information can be useful for debugging.
metainfo = my_results_explorer.load_metainfo(
results_folder_name='SDGym_results_06_24_2025'){
'run_06_24_2025_0': {
'sdgym_version': '0.10.0',
'sdv_version': '1.23.0',
'starting_date': '06_24_2025 18:05:35',
'completed_date': '06_24_2025 18:06:23',
'jobs': [
('alarm', 'GaussianCopulaSynthesizer'),
('census', 'GaussianCopulaSynthesizer'),
('census', 'CTGANSynthesizer'),
...]}}load_results
Use this function to load the results summary from a benchmarking run on a given day.
Parameters:
(required)
results_folder_name: A string with the name of folder that contains the run. Use thelistfunction to get a list of possible options.
Output: A pandas DataFrame object containing the results from that day. For more information about the results, see the Results Summary guide.
results_summary = my_results_explorer.load_results(
results_folder_name='SDGym_results_06_24_2025')Synthesizer Dataset Dataset_Size_MB Train_Time ...
GaussianCopulaSynthesizer alarm 34.5 123.56 ...
GaussianCopulaSynthesizer census 130.2 2356.12 ...
CTGANSynthesizer alarm 34.5 NaN ...
CTGANSynthesizer census 130.2 3140.4 ...
UniformSynthesizer alarm 34.5 1.1 ...
UniformSynthesizer census 130.2 15.5 ...load_synthesizer
Use this function to load a synthesizer object created for a dataset during a particular benchmark.
Parameters:
(required)
results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.(required)
dataset_name: A string with the name of the dataset that the synthesizer was trained on.(required)
synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.
Output: The synthesizer object that was created during the provided SDGym benchmarking run. The synthesizer has already been fitted on the given dataset.
my_synthesizer = my_results_explorer.load_synthesizer(
results_folder_name='SDGym_results_06_24_2025',
dataset_name='alarm',
synthesizer_name='GaussianCopulaSynthesizer'
)load_synthetic_data
Use this function to load the synthetic data created for a dataset during a particular benchmark.
Parameters:
(required)
results_folder_name: A string with the name of the sub-folder that corresponds to a benchmarking result. Use the list function to get the possible options.(required)
dataset_name: A string with the name of the dataset that the benchmark created synthetic data for.(required)
synthesizer_name: A string with the name of the synthesizer that was used. Use the results summary to get a list of all possible dataset and synthesizer options.
Output: A pandas.DataFrame object containing the synthetic data that corresponds to the given dataset for the given SDGym benchmarking run.
my_synthetic_data = my_results_explorer.load_synthetic_data(
results_folder_name='SDGym_results_06_24_2025',
dataset_name='alarm',
synthesizer_name='GaussianCopulaSynthesizer'
)load_real_data
Use this function to load the original dataset that is used for benchmarking SDGym. The dataset is the same regardless of the SDGym benchmark.
Parameters:
(required)
dataset_name: A string with the name of the dataset. For more information, see the Datasets reference.
Output: A pandas.DataFrame object containing the original dataset used by SDGym
original_dataset = my_results_explorer.load_real_data(
dataset_name='alarm'
)FAQs
Last updated