ClusterBasedNormalizer
Compatibility: numerical
data
The ClusterBasedNormalizer
performs a statistical transformation on numerical data. It approximates the overall column as a mixture of different shapes (components). Then, it normalizes the values and clusters them into the closest component.
Parameters
missing_value_replacement
: Add this argument to replace missing values during the transform phase
(default) | Replace missing values with a random value. The value is chosen uniformly at random from the min/max range. |
| Replace all missing values with the average value. |
| Replace all missing values with the most frequently occurring value |
| Replace all missing values with the specified number ( |
| Deprecated. Do not replace missing values. The transformed data will continue to have missing values. |
(deprecated) model_missing_values
: Use the missing_value_generation
parameter instead.
missing_value_generation
: Add this argument to determine how to recreate missing values during the reverse transform phase
(default) | Randomly assign missing values in roughly the same proportion as the original data. |
| Create a new column to store whether the value should be missing. Use it to recreate missing values. Note: Adding extra columns uses more memory and increases the RDT processing time. |
| Do not recreate missing values. |
max_clusters
: The maximum number of components to create when estimating the shape of the overall column.
(default) 10 | Cap the number of clusters to 10 |
| Cap the number of clusters to the specified value (eg. |
weight_threshold
: The minimum weight that is needed to possibly form a new component. Note that the total number of components is still capped by the max_clusters
argument above.
(default) | Create a new component when the weight is |
| Create a new component when the weight is at or above |
enforce_min_max_values
: Add this argument to allow the transformer to learn the min and max allowed values from the data.
(default) | Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present. |
| Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value. |
learn_rounding_scheme
: Add this argument to allow the transformer to learn about rounded values in your dataset.
(default) | Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present. |
| Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original. |
FAQ
Last updated