Explore Datasets

Understand your data. Understanding the characteristics of the datasets is important for benchmarking. Use SDGym to summarize the datasets by key characteristics such as the size of the data, number of tables, numbers of each kinds of columns, and more.

Dataset Explorer

Create a DatasetExplorer object to begin exploring your data. This object is setup to read from an Amazon S3 bucket that contains data.

Parameters:

  • s3_url: A string with the pathway to an S3 bucket containing the data. Provide this only if you have your own datasets in an S3 bucket. Make sure the URL starts with s3://.

    • By default, this is set to the public, demo dataset bucket published by DataCebo.

  • aws_access_key_id: A string containing the AWS access key id. Provide this if your s3_url is for a private S3 bucket.

  • aws_secret_access_key: A string containing the AWS secret access key. Provide this if your s3_url is for a private S3 bucket.

from sdgym import DatasetExplorer

explorer = DatasetExplorer()

Dataset Summary

A dataset summary contains characteristic descriptions for each dataset. We have chosen these characteristics as potentially important attributes that can affect benchmarking.

summarize_datasets

Use this DatasetExplorer function to generate a summary of characteristics for each dataset.

Parameters:

  • (required) modality: The modality of the datasets to summarize. This should be one of 'single_table', 'multi_table' or 'sequential'.

  • output_filepath: A string with the full output filepath where the results will be written. This should end in .csv.

Returns: A pandas DataFrame containing a summary of the dataset. Each row represents a dataset and each column represents an attribute about the data. If an output_filepath is provided, the same summary is saved in the file.

dataset_summary = explorer.summarize_datasets(
  modality='single_table',
  output_filepath='datasets_summary.csv'
)

Summary Characteristics

Each row of summary correspond to a dataset. This is denoted by the first, Dataset column.

The dataset summary captures many characteristics of the dataset — from the overall size to the individual columns, to the connections between the tables. Browse below for a comprehensive list of summary characteristics that are present as additional columns.

These characteristics capture the overall dataset size.

  • Datasize_Size_MB: The overall size of the dataset in MB

  • Num_Tables: The total number of tables in the dataset

  • Total_Num_Columns: The total number of columns (sum of all the column counts across all tables)

  • Total_Num_Rows: The total number of rows in the schema (sum of row counts across all tables)

  • Max_Num_Columns_Per_Table: The maximum number of columns that a table has in the dataset

  • Max_Num_Rows_Per_Table: The maximum number of rows in any one table has in the dataset

FAQ

SDGym used to have a function for getting available datasets. Is it still available?

The get_available_datasets function is a legacy function that was used in the past to perform a more basic exploration of the datasets (it only returned the dataset size and number of tables).

As of SDGym v0.11.0, you can continue to use this function, but this may be deprecated in future versions. We recommend switching to the DatasetExplorer instead.

from sdgym import get_available_datasets
sdgym.get_available_datasets(modality='single_table')
dataset_name        size_MB        num_tables
KRK_v1              0.072128       1
adult	            3.907448	   1
alarm	            4.520128	   1
asia	            1.280128	   1
...

Last updated