# Metadata

This guide describes the SDGym's metadata specification for single and multi-table data.

## What is metadata?

**Metadata** is a basic, factual description of a dataset that includes:

* The type of data that each column represents
* The primary keys and other identifiers of the table

{% hint style="info" %}
**How does the SDGym library use metadata?** Many of the synthesizer use information in the metadata to create higher quality synthetic data. For example, the SDV Synthesizers apply different logic to different column types.

Additionally, the evaluation framework factors in the metadata when applying metrics. For example, some metrics may only be applicable for specific column types.
{% endhint %}

The SDGym library expects that every dataset will have corresponding metadata provided as a JSON file. During benchmarking, the SDGym reads the file as a **Python dictionary**.

### Example

We assume that the data is present in a CSV format that describes rows and columns of a single table. For multi-table data, there are multiple tables that contain primary/foreign key connections.

<figure><img src="https://3464836953-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLLx9fQwQGgVNyQnbyMBb%2Fuploads%2FHNKsahoBjlaPvbAYa7Cg%2Fsdgym-synthetic-data-gym-resources-metadata_Aug%2004%202025.png?alt=media&#x26;token=af67f7ab-cb52-4436-9269-d50e40691e1c" alt=""><figcaption><p>This example of a single table includes a new row for each user. The row includes their personal information.</p></figcaption></figure>

<details>

<summary>Click to see the table's metadata</summary>

This is the metadata dictionary for the illustrated table.

```json
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "users": {
            "primary_key": "user_id",
            "columns": {
                "user_id": { "sdtype": "id", "regex": "U_[0-9]{3}" },
                "age": { "sdtype": "numerical" },
                "address": { "sdtype": "street_address" }, 
                "tier": { "sdtype": "categorical" },
                "active": { "sdtype": "boolean" },
                "paid_amt": { "sdtype": "numerical" },
                "renew_date": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" }
            }
        }
    }
}​
```

</details>

## Overview <a href="#metadata-specification" id="metadata-specification"></a>

The metadata for a single table contains the following elements:

* (required) `"METADATA_SPEC_VERSION"`: The version of the metadata. If you are using this, the metadata version will be `"V1"`, indicating that it is a multi table dataset that is compatible with SDV version 1.
* (required) `"tables"`: A dictionary that maps the table names to the table-specific metadata such as primary keys, column names and data types. Note that SDGym only works with single-table schemas.
* `"relationships"`: A list of dictionaries that specify the connections between the tables for multi-table data

## Tables

The tables dictionary maps each table name to the table-specific metadata. If you have a single-table, the table name does not matter but please be sure that the table-specific metadata matches your data. For multi-table data, the table name matters because it's needed to identify each table.

* (required) `"columns"`: A dictionary that maps the column names to the data types they represent and any other attributes.
* `"primary_key"`: The column name that is the primary key in the table
* `"alternate_keys"`: A list of column names that can act as alternate keys in the table

### Table Columns

When describing a column, you will provide the column name and the data type, known as the **sdtype**.

The 5 common sdtypes are: `"numerical"`, `"datetime"`, `"categorical"`, `"boolean"` and `"id"`. Click on the type below to learn more about the type and how to specify it in the metadata.

## Table Metadata

Each table in the metadata has two keys:

* `"primary_key"`: the column name used to identify a row in the table
* (required) `"columns"`: a dictionary description of each column

{% tabs %}
{% tab title="boolean" %}
Boolean columns represent True or False values.

```json
"active" : {
    "sdtype": "boolean"
}
```

**Properties** (None)
{% endtab %}

{% tab title="categorical" %}
Categorical columns represent discrete data

```json
"tier" : {
    "sdtype": "categorical"
}
```

**Properties** (None)
{% endtab %}

{% tab title="datetime" %}
Date columns represent a point in time

```json
"renew_date": {
    "sdtype": "datetime", 
    "datetime_format": "%Y-%m-%d"
}
```

**Properties**

* (required) `datetime_format`: A string describing the format as defined by [Python's strftime module](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).
  {% endtab %}

{% tab title="numerical" %}
Numerical columns represents discrete or continuous numerical values.&#x20;

```json
"age": {
    "sdtype": "numerical",
    "computer_representation": "Int64"
},
"paid_amt": {
    "sdtype": "numerical",
    "computer_representation": "Float"
}
```

**Properties**

* `computer_representation`: A string that represents how you'll ultimately store the data. This determines the min and max values allowed\
  Available options are: `'Float'`, `'Int8'`, `'Int16'`, `'Int32'`, `'Int64'`, `'UInt8'`, `'UInt16'`, `'UInt32'`, `'UInt64'`

Use `"type": "numerical"` to specify columns that represent whole number or continuous values
{% endtab %}

{% tab title="id" %}
ID columns represent identifiers that do not have any special mathematical or semantic meaning

```json
"user_id": { 
    "sdtype": "id",
    "regex_format": "U_[0-9]{3}"
}
```

**Properties**

* `regex_format`: A string describing the format of the ID as a [regular expression](https://docs.python.org/3/library/re.html)
  {% endtab %}

{% tab title="other" %}
You can input any other data type such as `'phone_number'`, `'ssn'` or `'email'`. See the [Sdtypes Reference](https://docs.sdv.dev/sdgym/resources/broken-reference) for a full list.

```json
"address": {
    "sdtype": "address",
    "pii": true
}
```

**Properties**

* `pii`: A boolean denoting whether the data is sensitive
  * (default) `true`: The column is sensitive, meaning the values should be anonymized&#x20;
  * `false`: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data
    {% endtab %}
    {% endtabs %}

## Relationships <a href="#relationships" id="relationships"></a>

A list of dictionary objects that describe the relationship between 2 connected tables, parent and child. The parent table contains the primary key references while the child table has rows that refer to its parent. Multiple child rows can refer to the same parent row.

* `"parent_table_name"`: The name of the parent table
* `"parent_primary_key"`: The primary key column in the parent table. This column uniquely identifies each row in the parent table .
* `"child_table_name"`: The name of the child table that refers to the parent
* `"child_foreign_key"`: The foreign key column in the child table. The values in this column contain a reference to a row in the parent table

Use new dictionaries for each relationship.

```json
"relationships": [{
    "parent_table_name": "users",
    "parent_primary_key": "user_id",
    "child_table_name": "sessions",
    "child_foreign_key": "user_id"
}, {
    "parent_table_name": "sessions",
    "parent_primary_key": "session_id",
    "child_table_name": "transaction",
    "child_foreign_key": "transacted_session_id"
}]
```

## FAQs <a href="#saving-and-loading-metadata" id="saving-and-loading-metadata"></a>

<details>

<summary>Should all the datasets include metadata?</summary>

Yes, every dataset available in the SDGym's demo module has an associated metadata file. If you are supplying custom datasets, make sure to write an attach a metadata file too. See [Datasets](https://docs.sdv.dev/sdgym/customization/datasets) for more information.

</details>

<details>

<summary>Can my custom synthesizer make use of the metadata?</summary>

Yes, your custom synthesizer can use any information in the metadata to help create higher quality synthetic data. The metadata information is passed into your synthesizer as a Python dictionary during the training process. See [Custom Synthesizers](https://docs.sdv.dev/sdgym/customization/synthesizers/custom-synthesizers) for more information.

</details>
