Search…
⌃K
Links

ClusterBasedNormalizer

Compatibility: numerical data
The ClusterBasedNormalizer performs a statistical transformation on numerical data. It approximates the overall column as a mixture of different shapes (components). Then, it normalizes the values and clusters them into the closest component.
from rdt.transformers.numerical import ClusterBasedNormalizer
cbe = ClusterBasedNormalizer()

Parameters

missing_value_replacement: Add this argument to replace missing values during the transform phase
(default) 'mean'
Replace all missing values with the average value.
'mode'
Replace all missing values with the most frequently occurring value
<number>
Replace all missing values with the specified number (0, -1, 0.5, etc.)
model_missing_values: Add this argument to create another column describing whether the values are missing
(default) False
Do not create a new column. During the reverse transform, missing values are added in again randomly.
True
Create a new column (if there are missing values). This allows you to keep track of the missing values.
Setting this value to True may add another column to your dataset. Adding extra columns uses more memory and increases the RDT processing time.
max_clusters: The maximum number of components to create when estimating the shape of the overall column.
(default) 10
Cap the number of clusters to 10
<number>
Cap the number of clusters to the specified value (eg. 5, 20, etc.). This must be a whole number (integer).
weight_threshold: The minimum weight that is needed to possibly form a new component. Note that the total number of components is still capped by the max_clusters argument above.
(default) 0.005
Create a new component when the weight is 0.005 or above.
<number>
Create a new component when the weight is at or above <number> (eg. 0.001, 0.01, etc.)
enforce_min_max_values: Add this argument to allow the transformer to learn the min and max allowed values from the data.
(default) False
Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present.
True
Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value.
learn_rounding_scheme: Add this argument to allow the transformer to learn about rounded values in your dataset.
(default) False
Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.
True
Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.

FAQ

Your decision to use this transformer is based on how you plan to use the transformed data. For example, some data science algorithms work better on normalized data. If you're planning to use such an algorithm, this transformer might be a good pre-processing step.
This transformer uses Bayesian Gaussian Mixture Models. Read more about them here.
Changing these values affects the accuracy and performance of the transformation.
  • Setting a high number of max_clusters or a low weight_threshold will create more components. This may make the clustering more accurate but will take more time. (In the extreme case, it may over-fit to the original data.)
  • Setting a lower number of max_clusters or a high weight_threshold will create fewer components. This may make clustering less accurate but will improve the time it takes for the algorithm to compete.
Using these options will enforce the min/max values or rounding scheme when reverse transforming your data. Use these parameters if you want to recover data in the same format as the original.
When setting the model_missing_values parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.