> For the complete documentation index, see [llms.txt](https://docs.sdv.dev/sdgym/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sdv.dev/sdgym/customization/datasets/explore-datasets.md).

# Explore Datasets

**Understand your data.** Understanding the characteristics of the datasets is important for benchmarking. Use SDGym to summarize the datasets by key characteristics such as the size of the data, number of tables, numbers of each kinds of columns, and more.&#x20;

## Dataset Explorer

Create a `DatasetExplorer` object to begin exploring your data. This object is setup to read from an Amazon S3 bucket that contains data.

**Parameters**:

* `s3_url`:  A string with the pathway to an S3 bucket containing the data. Provide this only if you have your own datasets in an S3 bucket. Make sure the URL starts with `s3://`.
  * By default, this is set to the public, demo dataset bucket published by DataCebo.
* `aws_access_key_id`: A string containing the AWS access key id. Provide this if your `s3_url` is for a private S3 bucket.
* `aws_secret_access_key`: A string containing the AWS secret access key. Provide this if your `s3_url` is for a private S3 bucket.

```python
from sdgym import DatasetExplorer

explorer = DatasetExplorer()
```

## Dataset Summary

A dataset summary contains characteristic descriptions for each dataset. We have chosen these characteristics as potentially important attributes that can affect benchmarking.

### summarize\_datasets

Use this DatasetExplorer function to generate a summary of characteristics for each dataset.

**Parameters**:

* (required) `modality`: The modality of the datasets to summarize. This should be one of `'single_table'`, `'multi_table'` or `'sequential'`.
* `output_filepath`: A string with the full output filepath where the results will be written. This should end in `.csv`.

**Returns**: A pandas DataFrame containing a summary of the dataset. Each row represents a dataset and each column represents an attribute about the data. If an `output_filepath` is provided, the same summary is saved in the file.

```python
dataset_summary = explorer.summarize_datasets(
  modality='single_table',
  output_filepath='datasets_summary.csv'
)
```

### Summary Characteristics

Each row of summary correspond to a dataset. This is denoted by the first, `Dataset` column.

The dataset summary captures many characteristics of the dataset — from the overall size to the individual columns, to the connections between the tables. Browse below for a comprehensive list of summary characteristics that are present as additional columns.

{% tabs %}
{% tab title="Overall Size" %}
These characteristics capture the overall dataset size.

* `Datasize_Size_MB`: The overall size of the dataset in MB
* `Num_Tables`: The total number of tables in the dataset
* `Total_Num_Columns`: The total number of columns (sum of all the column counts across all tables)
* `Total_Num_Rows`: The total number of rows in the schema (sum of row counts across all tables)
* `Max_Num_Columns_Per_Table`: The maximum number of columns that a table has in the dataset
* `Max_Num_Rows_Per_Table`: The maximum number of rows in any one table has in the dataset
  {% endtab %}

{% tab title="Column Types" %}
These characteristics quantify the types of columns inside each dataset. All of these are the sum of the total column counts across all tables of the dataset.

* `Total_Num_Columns_Categorical`: The total number of columns that are listed as `categorical` in the metadata.
* `Total_Num_Columns_Numerical`: The total number of columns that are listed as `numerical` in the metadata.
* `Total_Num_Columns_Datetime`: The total number of columns that are listed sdtype `datetime` in the metadata.
* `Total_Num_Columns_PII`: The total number of columns that are pii in the metadata. *Note that by default, PII is assumed to be true for all columns that are not numerical, categorical, boolean, datetime, or id.*
* `Total_Num_Columns_ID_NonKey`: The total number of columns that are listed as `id` in the metadata *and* are not any type of key (primary key, foreign key, etc.).
  {% endtab %}

{% tab title="Schema Complexity" %}
These characteristics capture the overall complexity of the dataset schema. This is particularly useful for multi-table datasets.

* `Num_Relationships`: The total # of relationships that are listed in the metadata
* `Max_Schema_Depth`: The maximum depth of the schema. *Note that max schema depth is the max number of tables in the chain. So a single table is considered to be depth=1, and a parent-child pair of tables is considered depth=2 .*
* `Max_Schema_Branch`: The maximum number of foreign keys that point to a given table. *So if a parent table has 3 child tables that are referring to it, its branching factor is 3.*
  {% endtab %}
  {% endtabs %}

## FAQ

<details>

<summary>SDGym used to have a function for getting available datasets. Is it still available?</summary>

The `get_available_datasets` function is a legacy function that was used in the past to perform a more basic exploration of the datasets (it only returned the dataset size and number of tables).&#x20;

As of SDGym v0.11.0, you can continue to use this function, but this may be deprecated in future versions. We recommend switching to the DatasetExplorer instead.&#x20;

```python
from sdgym import get_available_datasets
sdgym.get_available_datasets(modality='single_table')
```

```
dataset_name        size_MB        num_tables
KRK_v1              0.072128       1
adult	            3.907448	   1
alarm	            4.520128	   1
asia	            1.280128	   1
...
```

</details>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.sdv.dev/sdgym/customization/datasets/explore-datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
