The RDT library is a collection of objects that can understand your raw data convert it into cleaned, numerical data.
Transformers are the basic building blocks. They are designed to modify a single column of your dataset. All transformers can also be reversed.
Transformers are designed to work on specific types of data using different techniques. You can determine which strategies to use for your data, including handling missing values.
The HyperTransformer manages all the transformers you need for an entire, multi-column dataset. You can mix and match your favorite transformers on different columns of your data.
You can also reverse the process to recover the original data format.
The RDT library uses sdtypes to keep track of what each column in your data represents. You can think of an sdtype as representing the semantic (or statistical) meaning of a datatype.
The valid sdtypes in the public RDT library are:
'text'. More are available to licensed, Enterprise users.
An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as
'categorical') can be stored by a computer in several ways (text, integer, etc).
The config describes the plan for transforming all the columns in a dataset. It describes the columns in your dataset, their sdtypes and the transformer that will be applied to each one.
'credit_card': None, # do not do anything with this column
'age': None, # do not do anything with this column
In the example above, different transformers are assigned to each column, based on their types. Some columns do not have a transformer assigned to them, indicating that their data will not be transformed.