Quickstart
After you have installed RDT, you can get started using the demo dataset.
1
from rdt import get_demo
2
customers = get_demo()
Copied!
This dataset contains some randomly generated values that describes the customers of an online marketplace.
Click to see a detailed description of the dataset
Let's transform this data so that each column is converted to full, numerical data ready for data science.

Using the HyperTransformer

The HyperTransformer manages all the transformers you need for an entire, multi-column dataset. You can mix and match your favorite transformers on different columns of your data.
Let's start by creating a HyperTransformer object.
1
from rdt import HyperTransformer
2
ht = HyperTransformer()
Copied!

Creating the config

The config describes the plan for transforming all the columns in a dataset. It describes the columns in your dataset and the transformers that will be applied to each one.
You can ask the HyperTransformer to automatically detect it based on the data you plan to use.
1
ht.detect_initial_config(data=customers)
Copied!
This will create and set the config.
1
Config:
2
{
3
'sdtypes': {
4
'last_login': 'datetime',
5
'email_optin': 'boolean',
6
'credit_card': 'categorical',
7
'age': 'numerical',
8
'dollars_spent': 'numerical'
9
},
10
'transformers': {
11
'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
12
'email_optin': BinaryEncoder(missing_value_replacement="mode"),
13
'credit_card': FrequencyEncoder(),
14
'age': FloatFormatter(missing_value_replacement="mean"),
15
'dollars_spent': FloatFormatter(missing_value_replacement="mean")
16
}
17
}
Copied!
The sdtypes dictionary describes the semantic data types of each of your columns and the transformers dictionary describes which transformer to use for each column.

Modifying the config

To customize the transformer, you can modify any part of the config. You can update the sdtypes if they are wrong, swap out different transformers or update the transformer settings. (See the HyperTransformer Usage Guide for more details.)
Let's update some of the transformers. Start by creating the transformer objects that you want to use instead. The Transformers Glossary contains a list of all the available transformers and their settings.
1
# import and create new transformer objects
2
from rdt.transformers.datetime import OptimizedTimestampEncoder
3
from rdt.transformers.categorical import FrequencyEncoder
4
5
login_transformer = OptimizedTimestampEncoder(missing_value_replacement='mean')
6
credit_transformer = FrequencyEncoder(add_noise=True)
Copied!
Now you can update the config to use the new transformers.
1
ht.update_transformers(column_name_to_transformer={
2
'last_login': login_transformer,
3
'credit_card': credit_transformer
4
})
Copied!
The changes are now visible in the config.
1
ht.get_config()
Copied!
1
{
2
"sdtypes": {
3
"last_login": "datetime",
4
"email_optin": "boolean",
5
"credit_card": "categorical",
6
"age": "numerical",
7
"dollars_spent": "numerical"
8
},
9
"transformers": {
10
"last_login": OptimizedTimestampEncoder(missing_value_replacement="mean"),
11
"email_optin": BinaryEncoder(missing_value_replacement="mode"),
12
"credit_card": FrequencyEncoder(add_noise=True),
13
"age": FloatFormatter(missing_value_replacement="mean"),
14
"dollars_spent": FloatFormatter(missing_value_replacement="mean")
15
}
16
}
Copied!

Transforming the data

When you are satisfied with the config, you can begin to use the HyperTransformer. The first step is to process the data using fit.
For large datasets, this step may take some time. To avoid any errors, it's important to make sure that the data matches the config.
1
ht.fit(customers)
Copied!
After it's fit, you can begin to use the transformer. The transform method will return cleaned, numerical data that's ready for data science.
1
transformed_customers = ht.transform(customers)
Copied!
The HyperTransformer applied the assigned transformer to each individual column. Each column now contains fully numerical data that you can use for your project!
You can use the reverse_transform data to get back data in the original format with the original column names. Note that this data may not be exactly the same as the original data, depending on the transformers & their settings.
1
reversed_customers = ht.reverse_transform(transformed_customers)
Copied!

What's Next?

Use the rest of this documentation to dive into more details.
  • HyperTransformer: Learn about the advanced usage of the HyperTransformer, including the ability to change the transformers.
  • Sdtypes: RDT recognizes boolean, categorical, datetime, numerical and PII data. Identifying the correct sdtype is critical to choosing the right transformation.
  • Transformers Glossary: Browse through the many transformers you can use for your data and the settings that are available for each.
For more discussions and connecting with other users, join the SDV Slack through the link below.
Slack