Search…
⌃K
Links

FloatFormatter

Compatibility: numerical data
The FloatFormatter transforms numerical data. By default, it does nothing because numerical data is already ready to use for data science. But it can optionally handle missing values, learn rounding schemes and min/max bounds.
from rdt.transformers.numerical import FloatFormatter
ff = FloatFormatter()

Parameters

computer_representation: Add this argument when the original data has a specific representation, even if it's not loaded that way into Python. The transformer will make sure that any reverse transformed data is compatible with this representation.
(default) 'Float'
The data is a float
'Int8', 'Int16', 'Int32', 'Int64'
The data is a signed integer represented as an 8, 16, 32 or 64-bit number
'UInt8', 'UInt16', 'UInt32', 'UInt64'
The data is an unsigned integer represented as an 9, 16, 32 or 64-bit number
missing_value_replacement: Add this argument to replace missing values during the transform phase
(default) 'mean'
Replace all missing values with the average value.
'mode'
Replace all missing values with the most frequently occurring value
<number>
Replace all missing values with the specified number (0, -1, 0.5, etc.)
model_missing_values: Add this argument to create another column describing whether the values are missing
(default) False
Do not create a new column. During the reverse transform, missing values are added in again randomly.
True
Create a new column (if there are missing values). This allows you to keep track of the missing values so you can recreate them on the reverse transform.
Setting this value to True may add another column to your dataset. Adding extra columns uses more memory and increases the RDT processing time.
enforce_min_max_values: Add this argument to allow the transformer to learn the min and max allowed values from the data.
(default) False
Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present.
True
Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value.
learn_rounding_scheme: Add this argument to allow the transformer to learn about rounded values in your dataset.
(default) False
Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.
True
Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.

Examples

from transformers.numerical import FloatFormatter
ff = FloatFormatter(missing_value_replacement='mean',
learn_rounding_scheme=True,
model_missing_values=True)
On the forward transform, this transformer uses missing_value_replacement and model_missing_values. In this case, we create an extra column storing that the value is missing.
On the reverse transform, enforce_min_max_values and learn_rounding_scheme are applied. In this case, the values are rounded to 2 decimal digits like the original data. Also, missing values are added back in.

FAQs

Using these options will enforce the min/max values or rounding scheme when reverse transforming your data. Use these parameters if you want to recover data in the same format as the original.
The decision to replace missing values is based on how you plan to use your data. For example, you might be using RDT to clean your data for machine learning (ML). Check to see whether the ML techniques you plan to use allow missing values.
The method for replacing missing values is dependent on what they mean in your dataset. For example:
  • If missing values are the equivalent of 0, replace them with a 0.
  • If missing values indicate that you don't know the value at all, you might replace them with the 'mean' or the 'mode'
When setting the model_missing_values parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.