BinaryEncoder

Compatibility: boolean data

The BinaryEncoder transforms True and False values into numerical values of 0 and 1.

from rdt.transformers.boolean import BinaryEncoder
transformer = BinaryEncoder()

Parameters

missing_value_replacement: Add this argument to replace missing values during the transform phase

(default) 'mean'

Replace all missing values with the average value.

'mode'

Replace all missing values with the most frequently occurring value

<number>

Replace all missing values with the specified number (0, -1, 0.5, etc.)

None

Do not replace missing values. The transformed data will continue to have missing values.

(deprecated) model_missing_values: Use the missing_value_generation parameter instead.

missing_value_generation: Add this argument to determine how to recreate missing values during the reverse transform phase

(default) 'random'

Randomly assign missing values in roughly the same proportion as the original data.

'from_column'

Create a new column to store whether the value should be missing. Use it to recreate missing values. Note: Adding extra columns uses more memory and increases the RDT processing time.

None

Do not recreate missing values.

Examples

from rdt.transformers.boolean import BinaryEncoder
transformer = BinaryEncoder(missing_value_replacement='mode',
                            missing_value_generation='from_column')

FAQs

Should I replace missing values?

The decision to replace missing values is based on how you plan to use your data. For example, you might be using RDT to clean your data for machine learning (ML). Check to see whether the ML techniques you plan to use allow missing values.

What methods are the best for replacing missing values?

The method for replacing missing values is dependent on what they mean in your dataset. For example, if missing values are the equivalent of False, replace them with a 0.

When is it necessary to model missing values?

When setting the model_missing_values parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.

Last updated