OrderedUniformEncoder

Compatibility: categorical or boolean data

The OrderedUniformEncoder transforms data that represents ordered categorical values into a uniform distribution in the [0,1] interval. It preserves the frequencies of each category with high accuracy.

from rdt.transformers.categorical import OrderedUniformEncoder

transformer = OrderedUniformEncoder(order=['STRONGLY DISAGREE', 'DISAGREE', 'NEUTRAL',
                                           'AGREE', 'STRONGLY AGREE'])

Parameters

(required) order: Specify an order to the category values

Examples

from rdt.transformers.categorical import OrderedUniformEncoder

transformer = OrderedUniformEncoder(order=['STRONGLY DISAGREE', 'DISAGREE', 'NEUTRAL',
                                           'AGREE', 'STRONGLY AGREE'])

The transformer assigns each category to a unique, non-overlapping subset of the [0,1] interval. The order of the intervals is based on your custom order. The length of the interval is based on the category's frequency. For example if category 'AGREE' occurs with 20% frequency, the subset will have the length 0.2 such as [0.5, 0.7].

Attributes

After fitting the transformer, you can access the learned values through the attributes.

frequencies: A dictionary that maps each category value to the observed frequency, as a float between 0 and 1

>>> transformer.frequencies
{
  'STRONGLY DISAGREE': 0.1, 
  'DISAGREE': 0.2,
  'NEUTRAL': 0.2,
  'AGREE': 0.2,
  'STRONGLY AGREE': 0.3
}

intervals: A dictionary that maps each category value to an interval between [0,1]. This allows you to determine the exact rules used for transforming and reverse transforming.

>>> transformer.intervals
{
  'STRONGLY DISAGREE': [0, 0.1], 
  'DISAGREE': [0.1, 0.3],
  'NEUTRAL': [0.3, 0.5],
  'AGREE': [0.5, 0.7],
  'STRONGLY AGREE': [0.7, 1.0]
}

FAQs

When should I use this transformer?

The OrderedUniformEncoder is shown to preserve the frequency of each category value with high accuracy. This is especially useful if you have a data imbalance.

What if my categorical column does not have an order?

This transformer is only defined for ordinal categorical data. If there is no order, your data is nominal. Use the UniformEncoder instead.

What happens to missing values?

If there are missing values in your data, they should be defined as part of your order. Use the None keyword to denote a missing value.

In the example below, the missing value is added as the last item.

OrderedUniformEncoder(order=['STRONGLY DISAGREE', 'DISAGREE',
                             'NEUTRAL', 'AGREE', 'STRONGLY AGREE',
                             None])

Add the missing value to whatever ordering position makes sense for your data. If you are unsure, consider adding it to the beginning or the end of the list.

Last updated