All SDV models require information about the data type for every column. In the SDV, the data types are specified by sdtype, denoting a semantic or statistical meaning.
An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as
"categorical") can be stored by a computer in several ways (string, integer, etc).
There are 5 common sdtypes that describe columns in a dataset.
booleandescribes columns that contain
FALSEvalues and may contain some missing data.
categoricaldescribes columns that contain distinct categories. The defining aspect of a categorical column is that only the values that appear in the real data are valid.
The categories may be ordered or unordered.
An example of categorical data is tax payer status such as
Married filing jointly,
Widowed, etc. Only these distinct categories are allowed.
If you want the synthetic data to include new values that were not in the original data, then the column is not categorical. For example, if you have address data and would like the synthetic data to create new, unseen addresses, see other sdtypes below.
datetimedescribes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.
numericaldescribes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.
Some data may appear numerical but actually represents distinct categories. For example, HTTP response codes such as
404, etc. are categorical data. These numbers don't have a specific order, and they cannot be combined or averaged.
iddescribes columns that are used to identify rows (eg. as a primary or foreign key). ID columns do not have any other mathematical or special meanings. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a
-in the middle.
You may find that some of the columns in your dataset represent high-level, concepts in your domain. Such data might also contain sensitive, Personal Identifiable Information (PII).
For these types of concepts, the synthetic data can contain entirely new values that don't appear in the original data. In some cases, the SDV can also extract deeper meaning from the concepts to undersatnd the context.
Browse below for some common sdtypes related to different concepts.
These sdtypes describe the information about a person.
These sdtypes describe a location around the world.
These sdtypes describe information about computer networks and the internet.
These sdtypes describe information needed for banking functions such as payment transfers.
These sdtypes describe concepts from the automotive industry.
Many concepts vary based on the country and language. For example, phone numbers are represented differently in different countries.
The SDV aims to provide worldwide support. You can specify the locales when you create a synthesizer. This lets the SDV know that any higher-level concepts should conform to the right formatting rules for that country.
For example, assume you provide the following info:
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizer(
metadata, locales=['en_US', 'nl_BE'])
Then for every concept described in your metadata, the SDV will generate values only from the US or Belgium in the appropriate language (English or Dutch).
The public SDV randomly creates values corresponding to the concept, without taking additional context into account. Sometimes this may not be enough. For example, you may want to extract geographical areas from
phone_numberto ensure that it follows the same patterns.