UniformEncoder

Compatibility: categorical or boolean data

The UniformEncoder transforms data that represents categorical values into a uniform distribution in the [0,1] interval. It is highly accurate at preserving the overall frequencies of each category.

from rdt.transformers.categorical import UniformEncoder

transformer = UniformEncoder()

Parameters

order_by: Apply a prescribed ordering scheme. Use this if the discrete categorical values have an order.

(default) None

Do not apply a particular order

'numerical_value'

If the data is represented by integers or floats, order by those values

'alphabetical'

If the data is represented by strings, order them alphabetically.

Examples

from rdt.transformers.categorical import UniformEncoder

transformer = UniformEncoder(
    order_by='alphabetical'
)

The transformer assigns each category to a unique, non-overlapping subset of the [0,1] interval. The length of the interval is based on the category's frequency. For example if category 'CASH' occurs with 60% frequency, the subset will have the length 0.6 such as [0.2, 0.8].

Attributes

After fitting the transformer, you can access the learned values through the attributes.

frequencies: A dictionary that maps each category value to the observed frequency, as a float between 0 and 1

>>> transformer.frequencies
{
  'CREDIT': 0.2, 
  'CASH': 0.6,
  'DEBIT': 0.2
}

intervals: A dictionary that maps each category value to an interval between [0,1]. This allows you to determine the exact rules used for transforming and reverse transforming.

>>> transformer.intervals
{
  'CREDIT': [0, 0.2],
  'CASH': [0.2, 0.8],
  'DEBIT': [0.8, 1.0]
}

FAQs

When should I use this transformer?

The UniformEncoder is shown to preserve the frequency of each category value with high accuracy. This is especially useful if you have a data imbalance, for example if True occurs only 1% of the time while False occurs 99% of the time.

When should I use the order_by parameter?

Use this parameter when the categorical data is ordinal (has a specific order) and the order can easily be discovered through sorting. For example, you might storing survey responses as 'response_00', 'response_01', 'response_02', etc.

Don't add this parameter if it isn't necessary. Ordering increases the time it takes for transformation.

What if I'd like to sort the values by a custom order?

In some cases, your categories may not have an alphanumeric ordering scheme. Use the OrderedUniformEncoder to add your own, custom sorting order.

What happens to missing values?

This transformer treats missing values as if they are a new category of data.

Last updated