Search…
⌃K
Links

Config

In RDT, a config describes the columns in your dataset and the transformers that will be applied to each one.

Overall format

The config is a dictionary that must follow the format shown below.
{
'sdtypes': {
'column_name': <sdtype>,
'column_name': <sdtype>,
...
},
'transformers': {
'column_name': <transformer object>,
'column_name': <transformer object>,
...
}
}
The config is represented as a nested dictionary:
  • sdtypes is a dictionary that maps each column name to its semantic datatype. For example, 'boolean', 'categorical', 'datetime', 'numerical' or 'pii'. See the Sdtypes usage guide for more info.
  • transformers is a dictionary that maps the same column names to a transformer object that will be used for it. The Transformers Glossary contains a full list of available transformers and their settings. You can create an object assign it to a column name.
Transformers are only compatible with specific sdtypes. To avoid any errors, make sure the transformer you are specifying is compatible with the given sdtype.

Example

Below is an example of a demo dataset. This dataset contains some randomly generated values that describes the customers of an online marketplace.
Here is a config that you can use to transform this dataset:
{
'sdtypes': {
'last_login': 'datetime',
'email_optin': 'boolean',
'credit_card': 'categorical',
'age': 'numerical',
'dollars_spent': 'numerical'
},
'transformers': {
'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
'email_optin': BinaryEncoder(missing_value_replacement="mode"),
'credit_card': FrequencyEncoder(),
'age': FloatFormatter(missing_value_replacement="mean"),
'dollars_spent': FloatFormatter(missing_value_replacement="mean")
}
}

Skipping columns

Sometimes, you may not want to transform certain columns in your dataset. For these columns you can specify that the transformer is None.
When you do this, the HyperTransformer skips the columns. It simply carries over the column as-is when asked to transform.
{
'sdtypes': {
'last_login': 'datetime',
'email_optin': 'boolean',
'credit_card': 'categorical',
'age': 'numerical',
'dollars_spent': 'numerical'
},
'transformers': {
'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
'email_optin': BinaryEncoder(missing_value_replacement="mode"),
'credit_card': None, # do not do anything with this column
'age': None, # do not do anything with this column
'dollars_spent': FloatFormatter(missing_value_replacement="mean")
}
}