LabelEncoder

Compatibility: categorical data (nominal and ordinal)

The LabelEncoder transforms data that represents categorical values into integers 0, 1, 2, etc. corresponding to each category.

from rdt.transformers.categorical import LabelEncoder
le = LabelEncoder()

Parameters

order_by: Apply a prescribed ordering scheme to the values before assigning the labels

(default) None

Do not apply a particular order. The first unique value will be assigned label 0, the second unique value will be assigned label 1, etc.

'numerical_value'

If the data is represented by integers or floats, order by those values before assigning the labels. That is: label 0 will be assigned to the smallest value, label 1 will be assigned to the second smallest, etc.

'alphabetical'

If the data is represented by strings, order them alphabetically before assigning the labels. That is: label 0 will be assigned to the first alphabetical string, label 1 to the second, etc. Note: Digits will also be alphabetized in order from '0' to '9'.

add_noise: Add noise to the label values

(default) False

Do not not add noise. Each time a category appears, it will always be transformed to the same label value.

True

Add noise. A category will be transformed to the same label with some noise added. For example instead of the label 1, values might be noised to 1.001, 1.456, 1.999, etc.

Examples

from transformers.categorical import LabelEncoder

# order the values alphabetically before assigning the labels
# and then add noise to the labels
le = LabelEncoder(order_by='alphabetical', add_noise=True)

FAQs

When should I use the order_by parameter?

Use this parameter when the categorical data is ordinal (has a specific order) and the order can easily be discovered through sorting. For example, you might storing survey responses as 'response_00', 'response_01', 'response_02', etc.

Don't add this parameter if it isn't necessary. Ordering increases the time it takes for transformation.

What if I'd like to sort the values by a custom order?

In some cases, your categories may not have an alphanumeric ordering scheme. Use the OrderedLabelEncoder to add your own, custom sorting order.

What happens to missing values?

This transformer treats missing values as if they are a new category of data. If you are using the order_by parameter, the missing values will always be assigned the highest label value.

When should I add noise?

If you do not add noise, the transformer will convert each category to a distinct label. For example AMEX is always converted to the label 1. If you add noise, the transformer will generate some random variation so the numbers are not distinct. For example AMEX may sometimes be 1.001 and other times be 1.999-- but always in the interval[1, 2).

Adding noise creates a continuous distribution. Your decision to add noise is dependent on your use of the data. If you are using the data for machine learning (ML), consider whether the techniques you plan to use work better on continuous distributions.

Last updated