LabelEncoder
Compatibility:
categorical
data (nominal and ordinal)The
LabelEncoder
transforms data that represents categorical values into integers 0
, 1
, 2
, etc. corresponding to each category. 
from rdt.transformers.categorical import LabelEncoder
le = LabelEncoder()
order_by
: Apply a prescribed ordering scheme to the values before assigning the labels(default) None | Do not apply a particular order. The first unique value will be assigned label 0 , the second unique value will be assigned label 1 , etc. |
'numerical_value' | If the data is represented by integers or floats, order by those values before assigning the labels. That is: label 0 will be assigned to the smallest value, label 1 will be assigned to the second smallest, etc. |
'alphabetical' | If the data is represented by strings, order them alphabetically before assigning the labels. That is: label 0 will be assigned to the first alphabetical string, label 1 to the second, etc. Note: Digits will also be alphabetized in order from '0' to '9' . |
add_noise
: Add noise to the label values(default) False | Do not not add noise. Each time a category appears, it will always be transformed to the same label value. |
True | Add noise. A category will be transformed to the same label with some noise added. For example instead of the label 1 , values might be noised to 1.001 , 1.456 , 1.999 , etc. |
from transformers.categorical import LabelEncoder
# order the values alphabetically before assigning the labels
# and then add noise to the labels
le = LabelEncoder(order_by='alphabetical', add_noise=True)
Use this parameter when the categorical data is ordinal (has a specific order) and the order can easily be discovered through sorting. For example, you might storing survey responses as
'response_00'
, 'response_01'
, 'response_02'
, etc.Don't add this parameter if it isn't necessary. Ordering increases the time it takes for transformation.
In some cases, your categories may not have an alphanumeric ordering scheme. Use the OrderedLabelEncoder to add your own, custom sorting order.
If you do not add noise, the transformer will convert each category to a distinct label. For example
AMEX
is always converted to the label 1
. If you add noise, the transformer will generate some random variation so the numbers are not distinct. For example AMEX
may sometimes be 1.001
and other times be 1.999
-- but always in the interval[1, 2)
. Adding noise creates a continuous distribution. Your decision to add noise is dependent on your use of the data. If you are using the data for machine learning (ML), consider whether the techniques you plan to use work better on continuous distributions.
Last modified 7mo ago