After you have installed RDT, you can get started using the demo dataset.
from rdt import get_demo
customers = get_demo()
This dataset contains some randomly generated values that describes the customers of an online marketplace.
last_loginis the most recent day the user logged into the website
email_optindescribes whether the user has opted in to receiving marketing emails
credit_cardis the name of the primary credit card on file
ageis the user's self-reported age
dollars_spentis the cumulative total USD the user has spent
Some of these values may be missing for certain users due to various reasons.
Let's transform this data so that each column is converted to full, numerical data ready for data science.
The HyperTransformer manages all the transformers you need for an entire, multi-column dataset. You can mix and match your favorite transformers on different columns of your data.
Let's start by creating a HyperTransformer object.
from rdt import HyperTransformer
ht = HyperTransformer()
The config describes the plan for transforming all the columns in a dataset. It describes the columns in your dataset and the transformers that will be applied to each one.
You can ask the HyperTransformer to automatically detect it based on the data you plan to use.
This will create and set the config.
sdtypesdictionary describes the semantic data types of each of your columns and the
transformersdictionary describes which transformer to use for each column.
To customize the transformer, you can modify any part of the config. You can update the sdtypes if they are wrong, swap out different transformers or update the transformer settings. (See the HyperTransformer Usage Guide for more details.)
Let's update some of the transformers. Start by creating the transformer objects that you want to use instead. The Transformers Glossary contains a list of all the available transformers and their settings.
# import and create new transformer objects
from rdt.transformers.datetime import OptimizedTimestampEncoder
from rdt.transformers.categorical import FrequencyEncoder
login_transformer = OptimizedTimestampEncoder(missing_value_replacement='mean')
credit_transformer = FrequencyEncoder(add_noise=True)
Now you can update the config to use the new transformers.
The changes are now visible in the config.
When you are satisfied with the config, you can begin to use the HyperTransformer. The first step is to process the data using
For large datasets, this step may take some time. To avoid any errors, it's important to make sure that the data matches the config.
After it's fit, you can begin to use the transformer. The
transformmethod will return cleaned, numerical data that's ready for data science.
transformed_customers = ht.transform(customers)
The HyperTransformer applied the assigned transformer to each individual column. Each column now contains fully numerical data that you can use for your project!
You can use the
reverse_transformdata to get back data in the original format with the original column names. Note that this data may not be exactly the same as the original data, depending on the transformers & their settings.
reversed_customers = ht.reverse_transform(transformed_customers)
Use the rest of this documentation to dive into more details.
- HyperTransformer: Learn about the advanced usage of the HyperTransformer, including the ability to change the transformers.
- Sdtypes: RDT recognizes boolean, categorical, datetime, numerical and PII data. Identifying the correct sdtype is critical to choosing the right transformation.
- Transformers Glossary: Browse through the many transformers you can use for your data and the settings that are available for each.
For more discussions and connecting with other users, join the SDV Slack through the link below.