Basic Concepts

The RDT library is a collection of objects that can understand your raw data convert it into cleaned, numerical data.

Transformers

Transformers are the basic building blocks. They are designed to modify a single column of your dataset. All transformers can also be reversed.

Transformers are designed to work on specific types of data using different techniques. You can determine which strategies to use for your data, including handling missing values.

The Transformers Glossary contains a full list of available transformers and their settings.

HyperTransformer

The HyperTransformer manages all the transformers you need for an entire, multi-column dataset. You can mix and match your favorite transformers on different columns of your data.

You can also reverse the process to recover the original data format.

Read the HyperTransformer usage guide to learn more.

Sdtypes

The RDT library uses sdtypes to keep track of what each column in your data represents. You can think of an sdtype as representing the semantic (or statistical) meaning of a datatype.

The valid sdtypes in the public RDT library are: 'categorical', 'datetime', 'numerical', 'pii' and 'text'. More are available to licensed, Enterprise users.

An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as 'categorical') can be stored by a computer in several ways (text, integer, etc).

Config

The config describes the plan for transforming all the columns in a dataset. It describes the columns in your dataset, their sdtypes and the transformer that will be applied to each one.

{
  'sdtypes': {
    'last_login': 'datetime',
    'email_optin': 'boolean',
    'credit_card': 'categorical',
    'age': 'numerical',
    'dollars_spent': 'numerical'
  },
  'transformers': {
    'last_login': UnixTimestampEncoder(),
    'email_optin': LabelEncoder(add_noise=True),
    'credit_card': None, # do not do anything with this column
    'age': None, # do not do anything with this column
    'dollars_spent': FloatFormatter(missing_value_replacement="random")
  }
}

In the example above, different transformers are assigned to each column, based on their types. Some columns do not have a transformer assigned to them, indicating that their data will not be transformed.

Some transformers work on a combination of columns. For example, addresses may be present in multiple columns each corresponding to a different sdtype such as city or postcode. You can supply multiple columns to a transformer using a tuple.

{
    'sdtypes': {
        'name': 'pii',
        'age': 'numerical',
        'addr_1': 'street_address',
        'addr_2': 'secondary_address',
        'city': 'city',
        'state': 'state_abbr'
    },
    'transformers': {
        'name': AnonymizedFaker(),
        'age': FloatFormatter(missing_value_replacement="random"),
        ('addr_1', 'addr_2', 'city', 'state'): RandomLocationGenerator()
    }
}

Last updated