Search…
⌃K
Links

Sdtypes

The RDT library uses sdtypes to keep track of what the data represents. You can think of an sdtype as representing the semantic (or statistical) meaning of a datatype.
The valid sdtypes in the open source RDT: 'boolean', 'categorical', 'datetime', 'numerical', 'pii' and 'text'
An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as 'categorical') can be stored by a computer in several ways (text, integer, etc).

Boolean

Sdtype 'boolean' describes columns that contain TRUE or FALSE values and may contain some missing data.
For example, you may be recording whether users have opted into receiving your marketing emails.

Categorical

Sdtype 'categorical' describes columns that contain distinct categories. The defining aspect of a categorical column is that only the distinct values that appear in the data are valid.
The categories may be ordered or unordered. Ordered categories are known as ordinal while unordered categories are known as nominal.
For example, you may be recording the credit card company of your users. This can only take on specific values like "VISA", "AMEX" or "DISCOVER". This is nominal because the categories don't have any order.
Always double check to see if the categorical columns are what you expect. Categorical data might be easily confused with other sdtypes such as numerical.
For example, HTTP response codes such as 200, 404, etc. are actually stored as integers but they are categorical data. The codes represent distinct categories and they cannot be combined or averaged.

Datetime

Sdtype 'datetime' describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.
For example, you might be storing the last day users have logged into your site.
Always double check to see if the datetime columns are being detected. A datetime column might be incorrectly detected as categorical if it's in a non-standard format.

Numerical

Sdtype 'numerical' describes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.
For example, you might be storing product purchase amounts (USD) with 2 decimal digits. You might be storing the ages of your customers as whole numbers that must be 18 or above.

PII

Sdtype 'pii' stands for Personal Identifiable Information. The defining aspect of PII data is that you do not want the real values in your dataset to leak. Typically, PII has a higher-level, semantic meaning and it may be possible to create other new values completely from scratch if you know the meaning.
For example, names, phone numbers or addresses are all sensitive data that you do not want to leak.
PII columns are not automatically detected. Please verify your data and manually set any sensitive columns to PII.

Text

Sdtype 'text' describes columns with generic text. Text does not have any mathematical meaning or privacy implications. Generally, you'll use type 'text' to describe a column of structured text, for example surrogate keys that are used to identify rows.
For example, the primary key column user_id is a text column that can be used to identify each column. It has a specific format: 'ID_' followed by a 3-digit code.
Text columns are not automatically detected. Please verify your data and manually set any text columns.
Note: RDT does not currently provide support for natural language text.

Other sdtypes

You can purchase Premium Add-Ons for access to sdtypes within a specific context. For example sdtype 'phone_number' understands the specific meaning behind a phone number.