❖ ECDFNormalizer
Last updated
Last updated
Compatibility: numerical
data
The ECDFNormalizer
normalizes your data into a uniform or normal shape. To do this, estimates the . (On the reverse transform, this transformers brings the data back into its original shape.)
known_min_value
: A previously-known min value that the data must take. This determines the minimum possible value the transformer can accept.
(default) None
There is no known minimum value for the data. The transformer will compute one based on the fit data.
<float>
The transformer will make sure the data will never be less than the value
known_max_value
: A previously-known max value that the data must take. This determines the maximum possibile value the transformer can accept.
(default) None
There is no known maximum value for the data. The transformer will compute one based on the fit data.
<float>
The transformer will make sure the data will never be greater than the value
n_bins
: This parameter controls the number of bins to divide the data into when computing the empirical distribution. You can think of these as the # of bars a histogram of the data would have.
(default) 25
Divide up the data into 25 bins when computing the empirical distribution
<int>
Divide up the data into the provided number of bins
missing_range_encoding
: How to encode missing ranges (aka histogram bins that have a frequency of 0)
(default) 'exclude'
Bins with a frequency of 0 should not be included in the CDF function. This means that reverse transformed data will never be inside these ranges.
'low_probability'
Bins with a frequency of 0 should be included in the CDF function and assign a low probability. This means that the reverse transformed data can be inside missing ranges.
missing_value_encoding
: Add this argument to control how to encode missing values in the empirical distribution. Missing values can be binned together and represented as being in either the highest or lowest bin of the histogram.
(default) 'ecdf_low_bin'
Encode the missing values in the empirical CDF as the first (lowest) bin
'ecdf_high_bin'
Encoding the missing values in the empirical CDF as the final (highest) bin
normalized_distribution
: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.
(default) 'uniform'
Transform the data into a uniform distribution, between 0 and 1.
'norm'
Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.
learn_rounding_scheme
: Add this argument to allow the transformer to learn about rounded values in your dataset.
(default) False
Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.
True
Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.
This transformer creates a histogram of your data and uses it compute an . The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the .