> For the complete documentation index, see [llms.txt](https://docs.sdv.dev/sdv/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sdv.dev/sdv/~/changes/T3ZD1DOoRUEqkmrAGBZp/reference/metadata-spec/sdtypes.md).

# Sdtypes

All SDV models require information about the data type for every column. In the SDV, the data types are specified by **sdtype**, denoting a semantic or statistical meaning.

{% hint style="info" %}
An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as `"categorical"`) can be stored by a computer in several ways (string, integer, etc).
{% endhint %}

## Common Sdtypes

There are 5 common sdtypes that describe columns in a dataset.

### Boolean

Sdtype `boolean` describes columns that contain `TRUE` or `FALSE` values and may contain some missing data.

```json
{
    "is_active": {
        "sdtype": "boolean"
    }
}
```

### Categorical

Sdtype `categorical` describes columns that contain distinct categories. The defining aspect of a categorical column is that **only the values that appear in the real data are valid**.

The categories may be ordered or unordered.

{% hint style="info" %}
An example of categorical data is tax payer status such as `Single`, `Married filing jointly`, `Widowed`, etc. Only these distinct categories are allowed.&#x20;

If you want the synthetic data to include *new* values that were not in the original data, then the column is not categorical. For example, if you have address data and would like the synthetic data to create new, unseen addresses, see [other sdtypes](#other-sdtypes) below.
{% endhint %}

```json
{
    "gender": {
        "sdtype": "categorical"
    }
}
```

### Datetime

Sdtype `datetime` describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.

&#x20;**Properties**

* (required) `datetime_format`: A string describing the format as defined by [Python's strftime module](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

```json
{
    "start_date": { 
        "sdtype": "datetime",
        "datetime_format": "%Y-%m-%d"
    }
}
```

### Numerical

Sdtype `numerical` describes data with numbers. The defining aspect of numerical data is that **there is an order and you can apply a variety of mathematical computations** to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.

{% hint style="info" %}
Some data may appear numerical but actually represents distinct categories. For example, HTTP response codes such as `200`, `404`, etc. are categorical data. These numbers don't have a specific order, and they cannot be combined or averaged.
{% endhint %}

**Properties**

* `computer_representation`: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are: `'Float'`, `'Int8'`, `'Int16'`, `'Int32'`, `'Int64'`, `'UInt8'`, `'UInt16'`, `'UInt32'`, `'UInt64'`

```json
{
    "age": { 
        "sdtype": "numerical",
        "computer_representation": "Int64"
    },
    "transaction_amt": {
        "sdtype": "numerical",
        "computer_represntation": "Float"
    }
}
```

### ID

Sdtype `id` describes columns that are used to identify rows (eg. as a primary or foreign key). ID columns do not have any other mathematical or special meanings. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a `-` in the middle.&#x20;

**Properties**

* `regex_format`: A string describing the format of the ID as a [regular expression](https://docs.python.org/3/library/re.html)

```json
{
    "product_code": {
        "sdtype": "id",
        "regex_format": "[0-9]{4}-[0-9]{4}"
    }
}
```

## Additional Sdtypes: Domain-Specific Concepts & PII

You may find that some of the columns in your dataset represent high-level, concepts in your domain. Such data might also contain sensitive, Personal Identifiable Information (PII).

For these types of concepts, the synthetic data can contain *entirely new values* that don't appear in the original data. In some cases, the SDV can also extract deeper meaning from the concepts to understand the context.

Browse below for some common sdtypes related to different concepts.

{% tabs %}
{% tab title="Personal Info" %}
These sdtypes describe the information about a person.

<table data-header-hidden><thead><tr><th width="205"></th><th></th></tr></thead><tbody><tr><td>＊<code>phone_number</code></td><td>A local or international phone number such as <code>'+1(555)123-4567'</code>. Different countries have different formats.</td></tr><tr><td>＊<code>email</code></td><td>A person's email such as <code>'first_last@gmail.com'</code></td></tr><tr><td><code>ssn</code></td><td>A social security number such as <code>000-00-0000</code></td></tr><tr><td><code>first_name</code></td><td>A person's first name</td></tr><tr><td><code>last_name</code></td><td>A person's last name</td></tr></tbody></table>

*＊ Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more,* [*visit our website*](https://datacebo.com/pricing/)*.*
{% endtab %}

{% tab title="Location" %}
These sdtypes describe a location around the world.

<table data-header-hidden><thead><tr><th width="280"></th><th></th></tr></thead><tbody><tr><td>＊<code>country_code</code></td><td>A 2-character country code such as <code>'US'</code></td></tr><tr><td>＊<code>administrative_unit</code></td><td>The name of a region inside the country such as <code>'Massachusetts'</code>. Countries call this concept different names such as <em>state</em> or <em>province</em>.</td></tr><tr><td>＊<code>state_abbr</code></td><td>For countries that call their regions <em>states</em>, this refers to the 2-character code such as <code>'MA'</code></td></tr><tr><td>＊<code>city</code></td><td>The full name of the city such as <code>'Boston'</code></td></tr><tr><td>＊<code>postcode</code></td><td>The internationally-recognized, 5-digit postcode such as <code>02116</code></td></tr><tr><td>＊<code>street_address</code></td><td>The street and building number such as <code>'123 Main St'</code>. The exact format of this may vary by country</td></tr><tr><td>＊<code>secondary_address</code></td><td>Additional information about units in the building, such as <code>'Apartment #3'</code>. </td></tr><tr><td><code>latitude</code></td><td>The latitude of a location, expressed as a decimal</td></tr><tr><td><code>longitude</code></td><td>The longitude of a location, expressed as a decimal</td></tr></tbody></table>

*＊ Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more,* [*visit our website*](https://datacebo.com/pricing/)*.*
{% endtab %}

{% tab title="Networking" %}
These sdtypes describe information about computer networks and the internet.

<table data-header-hidden><thead><tr><th width="233"></th><th></th></tr></thead><tbody><tr><td><code>ipv4_address</code></td><td>An IP address, using the v4 protocol</td></tr><tr><td><code>ipv6_address</code></td><td>An IP address, using the v6 protool</td></tr><tr><td><code>mac_address</code></td><td>A media access control address</td></tr><tr><td><code>user_agent_string</code></td><td>A user agent string sent by the HTTP protocol</td></tr></tbody></table>

{% endtab %}

{% tab title="Banking" %}
These sdtypes describe information needed for banking functions such as payment transfers.

<table data-header-hidden><thead><tr><th width="242"></th><th></th></tr></thead><tbody><tr><td><code>iban</code></td><td>An international bank account number</td></tr><tr><td><code>swift11</code></td><td>A SWIFT bank code that uses 11 digits</td></tr><tr><td><code>swift8</code></td><td>A SWIFT bank code that uses 8 digits</td></tr><tr><td><code>credit_card_number</code></td><td>A credit card number, expressed using digits</td></tr></tbody></table>

{% endtab %}

{% tab title="Automotive" %}
These sdtypes describe concepts from the automotive industry.

<table data-header-hidden><thead><tr><th width="200"></th><th></th></tr></thead><tbody><tr><td><code>vin</code></td><td>A vehicle identification number</td></tr><tr><td><code>license_plate</code></td><td>A license plate number, expressed using digits, letters or other characters. The format varies by country.</td></tr></tbody></table>
{% endtab %}

{% tab title="Other" %}
Many other sdtypes are possible. The SDV models can use the [Python Faker library](https://faker.readthedocs.io/en/master/providers.html) for new data types. You can input any of the function names as sdtypes. For example, inputting the sdtype `passport_number` will use [this function](https://faker.readthedocs.io/en/master/providers/faker.providers.passport.html#faker.providers.passport.Provider.passport_number) to generate meaningful numbers.&#x20;

For full SDV support, [file a request](https://github.com/sdv-dev/SDV/issues/new/choose) to help us prioritize other data types.
{% endtab %}
{% endtabs %}

**Properties**

* `pii`: A boolean denoting whether the data is sensitive
  * (default) `true`: The column is sensitive, meaning the values should be anonymized. If not provided, we assume that the column is PII.
  * `false`: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data

```json
{
    "user_ssn": {
        "sdtype": "ssn"
    },
    "user_city": {
        "sdtype": "city",
        "pii": false
    }
}
```

### FAQs

<details>

<summary>How does the SDV factor in different countries or languages</summary>

Many concepts vary based on the country and language. For example, phone numbers are represented differently in different countries.

The SDV aims to provide worldwide support. You can specify the locales when you create a synthesizer. This lets the SDV know that any higher-level concepts should conform to the right formatting rules for that country.

For example, assume you provide the following info:

```python
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(
    metadata, locales=['en_US', 'nl_BE'])
```

Then for every concept described in your metadata, the SDV will generate values only from the US or Belgium in the appropriate language (English or Dutch).

</details>

<details>

<summary>Does the SDV understand the context of PII data?</summary>

The public SDV randomly creates values corresponding to the concept, without taking additional context into account. Sometimes this may not be enough. For example, you may want to extract geographical areas from `phone_number` to ensure that it follows the same patterns.

These features are available to licensed users. To learn more, [contact us](https://datacebo.com/contact/).

</details>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.sdv.dev/sdv/~/changes/T3ZD1DOoRUEqkmrAGBZp/reference/metadata-spec/sdtypes.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
