ClusterBasedNormalizerperforms a statistical transformation on numerical data. It approximates the overall column as a mixture of different shapes (components). Then, it normalizes the values and clusters them into the closest component.
from rdt.transformers.numerical import ClusterBasedNormalizer
cbe = ClusterBasedNormalizer()
missing_value_replacement: Add this argument to replace missing values during the transform phase
model_missing_values: Add this argument to create another column describing whether the values are missing
max_clusters: The maximum number of components to create when estimating the shape of the overall column.
weight_threshold: The minimum weight that is needed to possibly form a new component. Note that the total number of components is still capped by the
enforce_min_max_values: Add this argument to allow the transformer to learn the min and max allowed values from the data.
learn_rounding_scheme: Add this argument to allow the transformer to learn about rounded values in your dataset.
Changing these values affects the accuracy and performance of the transformation.
- Setting a high number of
max_clustersor a low
weight_thresholdwill create more components. This may make the clustering more accurate but will take more time. (In the extreme case, it may over-fit to the original data.)
- Setting a lower number of
max_clustersor a high
weight_thresholdwill create fewer components. This may make clustering less accurate but will improve the time it takes for the algorithm to compete.
When setting the
model_missing_valuesparameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.