Sdtypes

Let SDV know what type of data you have in each column of your table. In the SDV, the data types are specified by sdtype, denoting a semantic or statistical meaning. SDV is designed to create synthetic data differently based on each sdtype.

An sdtype is a high level concept that tells SDV what your data means. An sdtype does not depend on how a computer stores the data. A single sdtype (such as "categorical") can be stored by a computer in several ways (string, integer, etc).

Common Sdtypes

Below are 5 common sdtypes that describe columns in a dataset.

Boolean

A boolean columns contains TRUE or FALSE values and may contain some missing data.

{
    "is_active": {
        "sdtype": "boolean"
    }
}

Synthetic data goal: Your synthetic data will have the same proportion of True/False/missing values as the real data.

Categorical

Sdtype categorical describes columns that contain distinct categories. The defining aspect of a categorical column is that only the values that appear in the real data are valid.

The categories may be ordered or unordered.

An example of categorical data is tax payer status such as Single, Married filing jointly, Widowed, etc. Only these distinct categories are allowed.

{
    "gender": {
        "sdtype": "categorical"
    }
}

Synthetic data goal: Your synthetic data will contain the exact same category values as the real data, in similar proportions.

If you want the synthetic data to include new values that were not in the original data, then the column is not categorical. See other sdtypes below.

Datetime

Sdtype datetime describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.

Properties

(required) datetime_format: A string describing the format as defined by Python's strftime module.

{
    "start_date": { 
        "sdtype": "datetime",
        "datetime_format": "%Y-%m-%d"
    }
}

Synthetic data goal: Your synthetic data will contain datetime values that are within the same overall range and distribution shape as the real data. The values will also conform to the datetime format.

Numerical

Sdtype numerical describes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.

Some data may appear numerical but actually represents distinct categories. For example, HTTP response codes such as 200, 404, etc. are categorical data. These numbers don't have a specific order, and they cannot be combined or averaged.

Properties

computer_representation: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are: 'Float', 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'

{
    "age": { 
        "sdtype": "numerical",
        "computer_representation": "Int64"
    },
    "transaction_amt": {
        "sdtype": "numerical",
        "computer_represntation": "Float"
    }
}

Synthetic data goal: Your synthetic data will contain numerical values that are within the same overall range and distribution shape as the real data. They will also be rounded to the same precision as your real data.

ID

Sdtype id describes columns that are used to identify rows (eg. as a primary or foreign key). ID columns do not have any other mathematical or special meanings. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a - in the middle.

Properties

regex_format: A string describing the format of the ID as a regular expression

{
    "product_code": {
        "sdtype": "id",
        "regex_format": "[0-9]{4}-[0-9]{4}"
    }
}

Synthetic data goal: Your synthetic data will contain brand new, randomly generated IDs based on the regex. If you have multiple tables, the primary and foreign key IDs will match up.

Additional Sdtypes: Domain-Specific Concepts & PII

You may find that some of the columns in your dataset represent high-level, concepts in your domain. Such data might also contain sensitive, Personal Identifiable Information (PII).

Properties

pii: A boolean denoting whether the data is sensitive
- (default) true: The column is sensitive, meaning the values should be anonymized. If not provided, we assume that the column is PII.
- false: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data

{
    "user_ssn": {
        "sdtype": "ssn"
    },
    "user_city": {
        "sdtype": "city",
        "pii": false
    }
}

Synthetic data goals: Your synthetic data will contain entirely new values that do not necessarily appear in the original data. If you are using SDV Enterprise, the synthetic data may conform to some generic properties that are non-identifiable (eg. an area code of a phone number).

Browse below for some common sdtypes related to different concepts.

These sdtypes describe the information about a person.

＊phone_number

A local or international phone number such as '+1(555)123-4567'. Different countries have different formats.

＊email

A person's email such as '[email protected]'

ssn

A social security number such as 000-00-0000

first_name

A person's first name

last_name

A person's last name

＊ Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more, visit our website.

These sdtypes describe a location around the world.

＊country_code

A 2-character country code such as 'US'

＊administrative_unit

The name of a region inside the country such as 'Massachusetts'. Countries call this concept different names such as state or province.

＊state_abbr

For countries that call their regions states, this refers to the 2-character code such as 'MA'

＊city

The full name of the city such as 'Boston'

＊postcode

The internationally-recognized, 5-digit postcode such as 02116

＊street_address

The street and building number such as '123 Main St'. The exact format of this may vary by country

＊secondary_address

Additional information about units in the building, such as 'Apartment #3'.

latitude

The latitude of a location, expressed as a decimal

longitude

The longitude of a location, expressed as a decimal

＊ Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more, visit our website.

These sdtypes describe information about computer networks and the internet.

ipv4_address

An IP address, using the v4 protocol

ipv6_address

An IP address, using the v6 protool

mac_address

A media access control address

user_agent_string

A user agent string sent by the HTTP protocol

These sdtypes describe information needed for banking functions such as payment transfers.

iban

An international bank account number

swift11

A SWIFT bank code that uses 11 digits

swift8

A SWIFT bank code that uses 8 digits

credit_card_number

A credit card number, expressed using digits

These sdtypes describe concepts from the automotive industry.

vin

A vehicle identification number

license_plate

A license plate number, expressed using digits, letters or other characters. The format varies by country.

Many other sdtypes are possible. The SDV models can use the Python Faker library for new data types. You can input any of the function names as sdtypes. For example, inputting the sdtype passport_number will use this function to generate meaningful numbers.

For full SDV support, file a request to help us prioritize other data types.

FAQs

How does the SDV factor in different countries or languages

Many concepts vary based on the country and language. For example, phone numbers are represented differently in different countries.

The SDV aims to provide worldwide support. You can specify the locales when you create a synthesizer. This lets the SDV know that any higher-level concepts should conform to the right formatting rules for that country.

For example, assume you provide the following info:

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(
    metadata, locales=['en_US', 'nl_BE'])

Then for every concept described in your metadata, the SDV will generate values only from the US or Belgium in the appropriate language (English or Dutch).

Does the SDV understand the context of PII data?

The public SDV randomly creates values corresponding to the concept, without taking additional context into account. Sometimes this may not be enough. For example, you may want to extract geographical areas from phone_number to ensure that it follows the same patterns.

These features are available to licensed users. To learn more, contact us.

PreviousMetadata NextMetadata API

Last updated 10 months ago