❖ ECDFNormalizer

Compatibility: numerical data

SDV Enterprise Bundle. This feature is available as part of the XSynthesizers Bundle, an optional add-on to SDV Enterprise. For more information, please visit the XSynthesizers Bundle page.

The ECDFNormalizer normalizes your data into a uniform or normal shape. To do this, estimates the empirical distribution. (On the reverse transform, this transformers brings the data back into its original shape.)

from rdt.transformers.numerical import ECDFNormalizer

transformer = ECDFNormalizer(
    known_min_value=0
    known_max_value=100,
    normalized_distribution='uniform'
)

Parameters

known_min_value: A previously-known min value that the data must take. This determines the minimum possible value the transformer can accept.

(default) None

There is no known minimum value for the data. The transformer will compute one based on the fit data.

<float>

The transformer will make sure the data will never be less than the value

known_max_value: A previously-known max value that the data must take. This determines the maximum possibile value the transformer can accept.

(default) None

There is no known maximum value for the data. The transformer will compute one based on the fit data.

<float>

The transformer will make sure the data will never be greater than the value

n_bins : This parameter controls the number of bins to divide the data into when computing the empirical distribution. You can think of these as the # of bars a histogram of the data would have.

(default) 25

Divide up the data into 25 bins when computing the empirical distribution

<int>

Divide up the data into the provided number of bins

missing_range_encoding: How to encode missing ranges (aka histogram bins that have a frequency of 0)

(default) 'exclude'

Bins with a frequency of 0 should not be included in the CDF function. This means that reverse transformed data will never be inside these ranges.

'low_probability'

Bins with a frequency of 0 should be included in the CDF function and assign a low probability. This means that the reverse transformed data can be inside missing ranges.

missing_value_encoding: Add this argument to control how to encode missing values in the empirical distribution. Missing values can be binned together and represented as being in either the highest or lowest bin of the histogram.

(default) 'ecdf_low_bin'

Encode the missing values in the empirical CDF as the first (lowest) bin

'ecdf_high_bin'

Encoding the missing values in the empirical CDF as the final (highest) bin

normalized_distribution: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.

(default) 'uniform'

Transform the data into a uniform distribution, between 0 and 1.

'norm'

Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.

learn_rounding_scheme: Add this argument to allow the transformer to learn about rounded values in your dataset.

(default) False

Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.

True

Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.

FAQ

Which algorithms does this transformer use?

This transformer creates a histogram of your data and uses it compute an empirical CDF distribution. The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the probability integral transform.

Last updated