LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Transformers
  • HyperTransformer
  • Sdtypes
  • Config
  1. Usage

Basic Concepts

PreviousQuickstartNextHyperTransformer

Last updated 7 months ago

The RDT library is a collection of objects that can understand your raw data convert it into cleaned, numerical data.

Transformers

Transformers are the basic building blocks. They are designed to modify a single column of your dataset. All transformers can also be reversed.

Transformers are designed to work on specific types of data using different techniques. You can determine which strategies to use for your data, including handling missing values.

HyperTransformer

The HyperTransformer manages all the transformers you need for an entire, multi-column dataset. You can mix and match your favorite transformers on different columns of your data.

You can also reverse the process to recover the original data format.

Sdtypes

The RDT library uses sdtypes to keep track of what each column in your data represents. You can think of an sdtype as representing the semantic (or statistical) meaning of a datatype.

The valid sdtypes in the public RDT library are: 'categorical', 'datetime', 'numerical', 'pii' and 'id'. More are available to licensed, Enterprise users.

Older versions of RDT before 1.13.0 included an sdtype called 'text'. In the newer versions, please use 'id' instead.

An sdtype is a high level concept that does not depend on how a computer stores the data. A single sdtype (such as 'categorical') can be stored by a computer in several ways (text, integer, etc).

Config

The config describes the plan for transforming all the columns in a dataset. It describes the columns in your dataset, their sdtypes and the transformer that will be applied to each one.

{
  'sdtypes': {
    'last_login': 'datetime',
    'email_optin': 'boolean',
    'credit_card': 'categorical',
    'age': 'numerical',
    'dollars_spent': 'numerical'
  },
  'transformers': {
    'last_login': UnixTimestampEncoder(),
    'email_optin': LabelEncoder(add_noise=True),
    'credit_card': None, # do not do anything with this column
    'age': None, # do not do anything with this column
    'dollars_spent': FloatFormatter(missing_value_replacement="random")
  }
}

In the example above, different transformers are assigned to each column, based on their types. Some columns do not have a transformer assigned to them, indicating that their data will not be transformed.

Some transformers work on a combination of columns. For example, addresses may be present in multiple columns each corresponding to a different sdtype such as city or postcode. You can supply multiple columns to a transformer using a tuple.

{
    'sdtypes': {
        'name': 'pii',
        'age': 'numerical',
        'addr_1': 'street_address',
        'addr_2': 'secondary_address',
        'city': 'city',
        'state': 'state_abbr'
    },
    'transformers': {
        'name': AnonymizedFaker(),
        'age': FloatFormatter(missing_value_replacement="random"),
        ('addr_1', 'addr_2', 'city', 'state'): RandomLocationGenerator()
    }
}

The contains a full list of available transformers and their settings.

Read the to learn more.

Transformers Glossary
HyperTransformer usage guide