Sdtypes
Last updated
Last updated
All SDV models require information about the data type for every column. In the SDV, the data types are specified by sdtype, denoting a semantic or statistical meaning.
There are 5 common sdtypes that describe columns in a dataset.
Sdtype boolean
describes columns that contain TRUE
or FALSE
values and may contain some missing data.
Sdtype categorical
describes columns that contain distinct categories. The defining aspect of a categorical column is that only the values that appear in the real data are valid.
The categories may be ordered or unordered.
Sdtype datetime
describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.
Properties
Sdtype numerical
describes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.
Properties
computer_representation
: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are: 'Float'
, 'Int8'
, 'Int16'
, 'Int32'
, 'Int64'
, 'UInt8'
, 'UInt16'
, 'UInt32'
, 'UInt64'
Sdtype id
describes columns that are used to identify rows (eg. as a primary or foreign key). ID columns do not have any other mathematical or special meanings. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a -
in the middle.
Properties
You may find that some of the columns in your dataset represent high-level, concepts in your domain. Such data might also contain sensitive, Personal Identifiable Information (PII).
For these types of concepts, the synthetic data can contain entirely new values that don't appear in the original data. In some cases, the SDV can also extract deeper meaning from the concepts to understand the context.
Browse below for some common sdtypes related to different concepts.
These sdtypes describe the information about a person.
*phone_number
A local or international phone number such as '+1(555)123-4567'
. Different countries have different formats.
*email
A person's email such as 'first_last@gmail.com'
ssn
A social security number such as 000-00-0000
first_name
A person's first name
last_name
A person's last name
Properties
pii
: A boolean denoting whether the data is sensitive
(default) true
: The column is sensitive, meaning the values should be anonymized. If not provided, we assume that the column is PII.
false
: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data
(required) datetime_format
: A string describing the format as defined by .
regex_format
: A string describing the format of the ID as a
* Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more, .
* Licensed, enterprise users will see higher quality data. The SDV will extract the deeper meaning and replicate it in the synthetic data. To learn more, .
Many other sdtypes are possible. The SDV models can use the for new data types. You can input any of the function names as sdtypes. For example, inputting the sdtype passport_number
will use to generate meaningful numbers.
For full SDV support, to help us prioritize other data types.
These features are available to licensed users. To learn more, .