Explore Datasets
Understand your data. Understanding the characteristics of the datasets is important for benchmarking. Use SDGym to summarize the datasets by key characteristics such as the size of the data, number of tables, numbers of each kinds of columns, and more.
Dataset Explorer
Create a DatasetExplorer object to begin exploring your data. This object is setup to read from an Amazon S3 bucket that contains data.
Parameters:
s3_url: A string with the pathway to an S3 bucket containing the data. Provide this only if you have your own datasets in an S3 bucket. Make sure the URL starts withs3://.By default, this is set to the public, demo dataset bucket published by DataCebo.
aws_access_key_id: A string containing the AWS access key id. Provide this if yours3_urlis for a private S3 bucket.aws_secret_access_key: A string containing the AWS secret access key. Provide this if yours3_urlis for a private S3 bucket.
from sdgym import DatasetExplorer
explorer = DatasetExplorer()Dataset Summary
A dataset summary contains characteristic descriptions for each dataset. We have chosen these characteristics as potentially important attributes that can affect benchmarking.
summarize_datasets
Use this DatasetExplorer function to generate a summary of characteristics for each dataset.
Parameters:
(required)
modality: The modality of the datasets to summarize. This should be one of'single_table','multi_table'or'sequential'.output_filepath: A string with the full output filepath where the results will be written. This should end in.csv.
Returns: A pandas DataFrame containing a summary of the dataset. Each row represents a dataset and each column represents an attribute about the data. If an output_filepath is provided, the same summary is saved in the file.
dataset_summary = explorer.summarize_datasets(
modality='single_table',
output_filepath='datasets_summary.csv'
)Summary Characteristics
Each row of summary correspond to a dataset. This is denoted by the first, Dataset column.
The dataset summary captures many characteristics of the dataset — from the overall size to the individual columns, to the connections between the tables. Browse below for a comprehensive list of summary characteristics that are present as additional columns.
These characteristics capture the overall dataset size.
Datasize_Size_MB: The overall size of the dataset in MBNum_Tables: The total number of tables in the datasetTotal_Num_Columns: The total number of columns (sum of all the column counts across all tables)Total_Num_Rows: The total number of rows in the schema (sum of row counts across all tables)Max_Num_Columns_Per_Table: The maximum number of columns that a table has in the datasetMax_Num_Rows_Per_Table: The maximum number of rows in any one table has in the dataset
These characteristics quantify the types of columns inside each dataset. All of these are the sum of the total column counts across all tables of the dataset.
Total_Num_Columns_Categorical: The total number of columns that are listed ascategoricalin the metadata.Total_Num_Columns_Numerical: The total number of columns that are listed asnumericalin the metadata.Total_Num_Columns_Datetime: The total number of columns that are listed sdtypedatetimein the metadata.Total_Num_Columns_PII: The total number of columns that are pii in the metadata. Note that by default, PII is assumed to be true for all columns that are not numerical, categorical, boolean, datetime, or id.Total_Num_Columns_ID_NonKey: The total number of columns that are listed asidin the metadata and are not any type of key (primary key, foreign key, etc.).
These characteristics capture the overall complexity of the dataset schema. This is particularly useful for multi-table datasets.
Num_Relationships: The total # of relationships that are listed in the metadataMax_Schema_Depth: The maximum depth of the schema. Note that max schema depth is the max number of tables in the chain. So a single table is considered to be depth=1, and a parent-child pair of tables is considered depth=2 .Max_Schema_Branch: The maximum number of foreign keys that point to a given table. So if a parent table has 3 child tables that are referring to it, its branching factor is 3.
FAQ
Last updated