Search…
⌃K
Links

LabelEncoder

Compatibility: categorical data (nominal and ordinal)
The LabelEncoder transforms data that represents categorical values into integers 0, 1, 2, etc. corresponding to each category.
from rdt.transformers.categorical import LabelEncoder
le = LabelEncoder()

Parameters

order_by: Apply a prescribed ordering scheme to the values before assigning the labels
(default) None
Do not apply a particular order. The first unique value will be assigned label 0, the second unique value will be assigned label 1, etc.
'numerical_value'
If the data is represented by integers or floats, order by those values before assigning the labels. That is: label 0 will be assigned to the smallest value, label 1 will be assigned to the second smallest, etc.
'alphabetical'
If the data is represented by strings, order them alphabetically before assigning the labels. That is: label 0 will be assigned to the first alphabetical string, label 1 to the second, etc. Note: Digits will also be alphabetized in order from '0' to '9'.
add_noise: Add noise to the label values
(default) False
Do not not add noise. Each time a category appears, it will always be transformed to the same label value.
True
Add noise. A category will be transformed to the same label with some noise added. For example instead of the label 1, values might be noised to 1.001, 1.456, 1.999, etc.

Examples

from transformers.categorical import LabelEncoder
# order the values alphabetically before assigning the labels
# and then add noise to the labels
le = LabelEncoder(order_by='alphabetical', add_noise=True)

FAQs

Use this parameter when the categorical data is ordinal (has a specific order) and the order can easily be discovered through sorting. For example, you might storing survey responses as 'response_00', 'response_01', 'response_02', etc.
Don't add this parameter if it isn't necessary. Ordering increases the time it takes for transformation.
In some cases, your categories may not have an alphanumeric ordering scheme. Use the CustomLabelEncoder to add your own, custom sorting order.
This transformer treats missing values as if they are a new category of data. If you are using the order_by parameter, the missing values will always be assigned the highest label value.
If you do not add noise, the transformer will convert each category to a distinct label. For example AMEX is always converted to the label 1. If you add noise, the transformer will generate some random variation so the numbers are not distinct. For example AMEX may sometimes be 1.001 and other times be 1.999-- but always in the interval[1, 2).
Adding noise creates a continuous distribution. Your decision to add noise is dependent on your use of the data. If you are using the data for machine learning (ML), consider whether the techniques you plan to use work better on continuous distributions.