ClusterBasedNormalizer
Compatibility:
numerical
dataThe
ClusterBasedNormalizer
performs a statistical transformation on numerical data. It approximates the overall column as a mixture of different shapes (components). Then, it normalizes the values and clusters them into the closest component.
from rdt.transformers.numerical import ClusterBasedNormalizer
transformer = ClusterBasedNormalizer()
missing_value_replacement
: Add this argument to replace missing values during the transform phase(default) 'random' | Replace missing values with a random value. The value is chosen uniformly at random from the min/max range. |
'mean' | Replace all missing values with the average value. |
'mode' | Replace all missing values with the most frequently occurring value |
<number> | Replace all missing values with the specified number ( 0 , -1 , 0.5 , etc.) |
None | Deprecated. Do not replace missing values. The transformed data will continue to have missing values. |
(deprecated)
model_missing_values
: Use the missing_value_generation
parameter instead.missing_value_generation
: Add this argument to determine how to recreate missing values during the reverse transform phase(default) 'random' | Randomly assign missing values in roughly the same proportion as the original data. |
'from_column' | Create a new column to store whether the value should be missing. Use it to recreate missing values. Note: Adding extra columns uses more memory and increases the RDT processing time. |
None | Do not recreate missing values. |
max_clusters
: The maximum number of components to create when estimating the shape of the overall column.(default) 10 | Cap the number of clusters to 10 |
<number> | Cap the number of clusters to the specified value (eg. 5 , 20 , etc.). This must be a whole number (integer). |
weight_threshold
: The minimum weight that is needed to possibly form a new component. Note that the total number of components is still capped by the max_clusters
argument above.(default) 0.005 | Create a new component when the weight is 0.005 or above. |
<number> | Create a new component when the weight is at or above <number> (eg. 0.001 , 0.01 , etc.) |
enforce_min_max_values
: Add this argument to allow the transformer to learn the min and max allowed values from the data.(default) False | Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present. |
True | Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value. |
learn_rounding_scheme
: Add this argument to allow the transformer to learn about rounded values in your dataset.(default) False | Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present. |
True | Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original. |
Changing these values affects the accuracy and performance of the transformation.
- Setting a high number of
max_clusters
or a lowweight_threshold
will create more components. This may make the clustering more accurate but will take more time. (In the extreme case, it may over-fit to the original data.) - Setting a lower number of
max_clusters
or a highweight_threshold
will create fewer components. This may make clustering less accurate but will improve the time it takes for the algorithm to compete.
When setting the
model_missing_values
parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.Last modified 29d ago