# Metadata JSON

This guide describes the metadata JSON spec.

<figure><img src="https://1967107441-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FfNxEeZzl9uFiJ4Zf4BRZ%2Fuploads%2FpPLDP2gOJBQvF5yF6Zmz%2Fmulti-table-data-data-preparation_May%208%202025.png?alt=media&#x26;token=f23c6cac-43db-4e08-9eaf-5c850aacecfc" alt=""><figcaption><p>An example of multi table data. The tables are connected to each other through primary/foreign keys.</p></figcaption></figure>

<details>

<summary>Click to see the metadata JSON file</summary>

This is an example of a JSON file describing a multi table schema.

```json
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" }
            },
            "column_relationships": []
        },
        "guests": {
            "primary_key": "guest_email",
            "columns": {
                "guest_email": { "sdtype": "email" },
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "checkin_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "room_type": { "sdtype": "categorical" }
            },
            "column_relationships": []
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}
```

</details>

{% hint style="success" %}
**Create your metadata programmatically.** Use the [Python API ](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/creating-metadata)to automatically detect the metadata based on your data.
{% endhint %}

## Overview

The metadata for a single table contains the following elements:

* (required) `"METADATA_SPEC_VERSION"`: The version of the metadata. If you are using this, the metadata version will be `"V1"`, indicating that it is a multi table dataset that is compatible with SDV version 1.
* (required) `"tables"`: A dictionary that maps the table names to the table-specific metadata such as primary keys, column names and data types
* (required) `"relationships"`: A list of dictionaries that specify the connections between the tables

## Tables

The tables dictionary maps each table name to the table-specific metadata. This includes:

* (required) `"columns"`: A dictionary that maps the column names to the data types they represent and any other attributes.
* `"primary_key"`: The column name that is the primary key in the table. This must be an ID or PII sdtype.\
  ＊ *If the table has a composite key, this will be a list of column names instead; in this case, at least 1 of the columns must be an ID or another PII sdtype. Only SDV Enterprise users can create synthesizers with composite keys.*
* `"alternate_keys"`: A list of column names that can act as alternate keys in the table\
  ＊ *If the table has a composite key, this will be a list of a list of column names instead; in this case, at least 1 of the columns must be an ID or another PII sdtype. Only SDV Enterprise users can create synthesizers with composite keys.*

### Table Columns

When describing a column, you will provide the column name and the data type, known as the **sdtype**.&#x20;

The 5 common sdtypes are: `"numerical"`, `"datetime"`, `"categorical"`, `"boolean"` and `"id"`. Click on the type below to learn more about the type and how to specify it in the metadata.

{% tabs %}
{% tab title="boolean" %}
Boolean columns represent True or False values.

```json
"is_active" : {
    "sdtype": "boolean"
}
```

**Properties** (None)
{% endtab %}

{% tab title="categorical" %}
Categorical columns represent discrete data. By default, they are unordered (aka nominal data).

```json
"room_type" : {
    "sdtype": "categorical"
}
```

**Properties** (None)
{% endtab %}

{% tab title="datetime" %}
Date columns represent a point in time

```json
"checkin_date": {
    "sdtype": "datetime", 
    "datetime_format": "%d %b %Y"
}
```

**Properties**

* (required) `datetime_format`: A string describing the format as defined by [Python's strftime module](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).
  {% endtab %}

{% tab title="numerical" %}
Numerical columns represents discrete or continuous numerical values.&#x20;

```json
"rating": {
    "sdtype": "numerical",
    "computer_representation": "Float"
}
```

**Properties**

* `computer_representation`: A string that represents how you'll ultimately store the data. This determines the min and max values allowed\
  Available options are: `'Float'`, `'Int8'`, `'Int16'`, `'Int32'`, `'Int64'`, `'UInt8'`, `'UInt16'`, `'UInt32'`, `'UInt64'`
  {% endtab %}

{% tab title="id" %}
ID columns represent identifiers that do not have any special mathematical or semantic meaning

```json
"hotel_id": { 
    "sdtype": "id",
    "regex_format": "HID_[0-9]{3}"
}
```

**Properties**

* `regex_format`: A string describing the format of the ID as a [regular expression](https://docs.python.org/3/library/re.html)
  {% endtab %}

{% tab title="other" %}
You can input any other data type such as `'phone_number'`, `'ssn'` or `'email'`. See the [Sdtypes Reference](https://docs.sdv.dev/sdv/concepts/sdtypes#conceptual-sdtypes) for a full list.

```json
"guest_email": {
    "sdtype": "email",
    "pii": true
}
```

**Properties**

* `pii`: A boolean denoting whether the data is sensitive
  * (default) `true`: The column is sensitive, meaning the values should be anonymized&#x20;
  * `false`: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data
    {% endtab %}
    {% endtabs %}

### Column Relationships

Annotate groups of columns that represents higher level concepts. Denote the concept using the `"relationship_type"` keyword, followed by `"column_names"` with the list of columns involved. The column names can be present in any order.

Each relationship type supports different types of columns. Browse the table below to explore different options.

{% tabs %}
{% tab title="address" %}
An address is defined by 2 or more columns that have the following sdtypes: `country_code`, `administrative_unit`, `state`, `state_abbr`, `city`, `postcode`, `street_address` and `secondary_address`.

```python
{
    "type": "address",
    "column_names": ["addr_line1", "addr_line2", "city", "state", "zipcode"]
}
```

{% endtab %}

{% tab title="gps" %}
A GPS coordinate pair is defined by 2 columns:&#x20;

* sdtype `latitude` &
* sdtype `longitude`

```python
{
    "type": "gps",
    "column_names": ["location_lat", "location_lon"]
}
```

{% endtab %}

{% tab title="More coming soon!" %}
Additional column relationships coming soon!

*Do you have a request for a type of column relationship? Please* [*file a feature request*](https://github.com/sdv-dev/SDV/issues/new/choose) *describing your use case.*
{% endtab %}
{% endtabs %}

{% hint style="info" %}
While anyone can add column relationships to their data, SDV Enterprise users will see the highest quality data for the relationships. To learn more about the SDV Enterprise and its extra features, [visit our website](https://datacebo.com/pricing/).
{% endhint %}

## Relationships

A list of dictionary objects that describe the relationship between 2 connected tables, parent and child. The parent table contains the primary key references while the child table has rows that refer to its parent. Multiple child rows can refer to the same parent row.

* `"parent_table_name"`: The name of the parent table
* `"parent_primary_key"`: The primary key column in the parent table. This column uniquely identifies each row in the parent table. The column must be an ID or another PII sdtype.\
  ＊ *If the table has a composite key, this will be a list a list of column names instead; in this case, at least 1 of the columns must be an ID or another PII sdtype. Only SDV Enterprise users can create synthesizers with composite keys.*
* `"child_table_name"`: The name of the child table that refers to the parent
* `"child_foreign_key"`: The foreign key column in the child table. The values in this column contain a reference to a row in the parent table\
  ＊ *If the table has a composite key, this will be a list of column names instead. The length of this list should match the parent primary key. Only SDV Enterprise users can create synthesizers with composite keys.*

Use multiple dictionaries to represent multiple relationships.

```json
"relationships": [{
    "parent_table_name": "users",
    "parent_primary_key": "user_id",
    "child_table_name": "sessions",
    "child_foreign_key": "user_id"
}, {
    "parent_table_name": "sessions",
    "parent_primary_key": "session_id",
    "child_table_name": "transaction",
    "child_foreign_key": "transacted_session_id"
}]
```

## Multi Sequence Data

In some cases, you may have a table that describes multiple, ordered sequences of data such as the one shown below.

<figure><img src="https://1967107441-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FfNxEeZzl9uFiJ4Zf4BRZ%2Fuploads%2FzPsbsXbMULfrTaOTilK3%2Fsequential-data-data-preparation_May%208%202025.png?alt=media&#x26;token=478612f6-1474-4ae8-99fa-dfec4304b96d" alt=""><figcaption><p>An example of sequential data. There are multiple sequences (one for each Patient ID). Within each sequence is an ordered set of rows.</p></figcaption></figure>

You can annotate the sequences within a table by applying the `sequence_key` and `sequence_index` keywords.

<details>

<summary>Click to see the metadata JSON</summary>

```json
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "patients": {
            "sequence_key": "Patient ID",
            "sequence_index": "Time",
            "columns": {
                "Patient ID": { "sdtype": "id", "regex_format": "ID_[0-9]{3}" },
                "Address": { "sdtype": "physical_address", "pii": true },
                "Smoker": { "sdtype": "boolean" },
                "Time": { "sdtype": "datetime", "datetime_format": "%m/%d/%Y" },
                "Heart Rate": { "sdtype": "categorical" },
                "Systolic BP": { "sdtype": "numerical" }
            }
        }
    }
}
```

</details>

**`"sequence_key"`**: A column name of the sequence key, if you have multi-sequence data

{% hint style="info" %}
The **sequence key** is a column that identify which row(s) belong to which sequences. This is usually an ID column but it may also be a PII sdtype (such as `"phone_number"`).&#x20;

This is important for tables that contain multiple sequences. In our example, the sequence key is `'Patient ID'` because this column is used to break up the sequences.

If you don't supply a sequence key, the SDV assumes that your table only contains a single sequence. *Note: The SDV sequential models do not fully support single sequence data.*
{% endhint %}

**`"sequence_index"`**: A column name of the sequence index, if you have sequential data

{% hint style="info" %}
The **sequence index** determines the spacing between the rows in a sequence. Use this if you have an explicit index such as a timestamp. If you don't supply a sequence index, the SDV assumes there is equal spacing of an unknown unit.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdv/concepts/metadata/metadata-json.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
