Sdtypes

Let SDV know what type of data you have in each column of your table. In the SDV, the data types are specified by sdtype, denoting a semantic or statistical meaning. SDV is designed to create synthetic data differently based on each sdtype.

An sdtype is a high level concept that tells SDV what your data means. An sdtype does not depend on how a computer stores the data. A single sdtype (such as "categorical") can be stored by a computer in several ways (string, integer, etc).

Common Sdtypes

Below are 5 common sdtypes that describe columns in a dataset.

Boolean

A boolean columns contains TRUE or FALSE values and may contain some missing data.

{
    "is_active": {
        "sdtype": "boolean"
    }
}

Categorical

Sdtype categorical describes columns that contain distinct categories. The defining aspect of a categorical column is that only the values that appear in the real data are valid.

The categories may be ordered or unordered.

An example of categorical data is tax payer status such as Single, Married filing jointly, Widowed, etc. Only these distinct categories are allowed.

Datetime

Sdtype datetime describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.

Properties

Numerical

Sdtype numerical describes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.

Some data may appear numerical but actually represents distinct categories. For example, HTTP response codes such as 200, 404, etc. are categorical data. These numbers don't have a specific order, and they cannot be combined or averaged.

Properties

  • computer_representation: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are: 'Float', 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'

ID

Sdtype id describes columns that are used to identify rows (eg. as a primary or foreign key). ID columns may not have a statistical or mathematical meaning behind them. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a - in the middle.

Properties

* ID Regex Formats in SDV Enterprise

SDV Enterprise is designed to automatically detect Regexes based on your data. You can view or update these Regexes. SDV Enterprise can also detect and accept Regexes that contain embedded context.

What is embedded context? Many ID columns have completely random values that do not contain any statistical meaning. But other times, a portion of your ID column may have a meaning — it is not completely random. We call this portion of the ID column embedded context.

For example consider an ID that starts with a 2-letter country code followed by a 5-character random value such as US-3RO9P or CA-99QC4. In this case, the first 2 letters form the context becuase there is a meaning behind them. SDV should not invent random countries. However, the final 5 characters are random without any statistical meaning.

SDV Enterprise detects embedded context using named capture groups within the Regex. Denoted by a parenthesis and the group name, named capture groups let the model know to learn the context exactly as-is and not invent new random values for it. In the example above, the Regex format would be:

The name of group (in this case country) is important.

  • If it represents and entirely new concept that is not found anywhere else in the data, then it should contain a brand new name that is not the same as any other column.

  • If the concept already exists as another column in the data, then it should be exactly the same as the column name.

SDV Enterprise is designed to auto-detect Regexes, including Regexes with context. If you are experiencing issues with this feature, please reach out.

Additional Sdtypes: Domain-Specific Concepts & PII

You may find that some of the columns in your dataset represent high-level, concepts in your domain. Such data might also contain sensitive, Personal Identifiable Information (PII).

Properties

  • pii: A boolean denoting whether the data is sensitive

    • (default) true: The column is sensitive, meaning the values should be anonymized. If not provided, we assume that the column is PII.

    • false: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data

Browse below for some common sdtypes related to different concepts.

These sdtypes describe the information about a person.

phone_number

A local or international phone number such as '+1(555)123-4567'. Different countries have different formats.

email

A person's email such as '[email protected]'

ssn

A social security number such as 000-00-0000

first_name

A person's first name

last_name

A person's last name

* Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more, visit our website.

FAQs

How does the SDV factor in different countries or languages

Many concepts vary based on the country and language. For example, phone numbers are represented differently in different countries.

The SDV aims to provide worldwide support. You can specify the locales when you create a synthesizer. This lets the SDV know that any higher-level concepts should conform to the right formatting rules for that country.

For example, assume you provide the following info:

Then for every concept described in your metadata, the SDV will generate values only from the US or Belgium in the appropriate language (English or Dutch).

Does the SDV understand the context of PII data?

The public SDV randomly creates values corresponding to the concept, without taking additional context into account. Sometimes this may not be enough. For example, you may want to extract geographical areas from phone_number to ensure that it follows the same patterns.

These features are available to licensed users. To learn more, contact us.

Last updated