# GaussianNormalizer

**Compatibility:** `numerical` data

{% hint style="warning" %}
To use this transformer, you must install the `copulas` module in addition to `rdt`. This is available in open source for all users.

```
pip install rdt[copulas]
```

{% endhint %}

The `GaussianNormalizer` performs a statistical transformation on numerical data. It approximates the shape of the overall column. Then, it converts the data to a different shape: a standard normal distribution (aka a bell curve with mean = 0 and standard deviation = 1).

![](https://2225246359-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FVGX92M819eIp0rMg5elc%2Fuploads%2F00IJManz7vohZtRuiauy%2Frdt_transformers-glossary-numerical-gaussian-normalizer_June%2002%202025.png?alt=media\&token=b72d98ec-7f94-4df1-a1c6-2ea8e3fcb4a6)

```python
from rdt.transformers.numerical import GaussianNormalizer
transformer = GaussianNormalizer()
```

## Parameters

**`missing_value_replacement`**: Add this argument to replace missing values during the transform phase

<table data-header-hidden><thead><tr><th width="212"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'mean'</code></td><td>Replace all missing values with the average value.</td></tr><tr><td><code>'random'</code></td><td>Replace missing values with a random value. The value is chosen uniformly at random from the min/max range.</td></tr><tr><td><code>'mode'</code></td><td>Replace all missing values with the most frequently occurring value</td></tr><tr><td><code>&#x3C;number></code></td><td>Replace all missing values with the specified number (<code>0</code>, <code>-1</code>, <code>0.5</code>, etc.)</td></tr><tr><td><code>None</code></td><td>Do not replace missing values. The transformed data will continue to have missing values.</td></tr></tbody></table>

*(deprecated) `model_missing_values`: Use the `missing_value_generation` parameter instead.*

**`missing_value_generation`**: Add this argument to determine how to recreate missing values during the reverse transform phase

<table data-header-hidden><thead><tr><th width="203"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'random'</code></td><td>Randomly assign missing values in roughly the same proportion as the original data.</td></tr><tr><td><code>'from_column'</code></td><td>Create a new column to store whether the value should be missing. Use it to recreate missing values. <em>Note: Adding extra columns uses more memory and increases the RDT processing time.</em></td></tr><tr><td><code>None</code></td><td>Do not recreate missing values.</td></tr></tbody></table>

**`distribution`**: In the first step of the normalization, the transformer approximates the shape (aka distribution) of the overall column. Use this parameter to specify the overall shape.

<table data-header-hidden><thead><tr><th width="262.5"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'truncnorm'</code></td><td>Approximate the shape as a truncated normal distribution</td></tr><tr><td><code>&#x3C;name></code></td><td>Approximate the shape based on the provided distribution type. Possible options are: <code>'norm'</code>, <code>'gamma'</code>, <code>'beta'</code>, <code>'t'</code>, <code>'truncnorm'</code>, <code>'uniform'</code> and <code>'gaussian_kde'</code><br><br><em>Deprecated: <code>'gaussian'</code>, <code>'truncated_gaussian'</code> and <code>'student_t'</code>. Instead, please use the names <code>'norm'</code>, <code>'truncnorm'</code> and <code>'t'</code> (respectively).</em></td></tr><tr><td><code>&#x3C;copulas.univariate.Univariate></code></td><td>Use the Univariate object created from the Copulas library. See the <a href="https://sdv.dev/Copulas/tutorials/02_Univariate_Distributions.html">User Guide</a> for more information.</td></tr></tbody></table>

**`enforce_min_max_values`**: Add this argument to allow the transformer to learn the min and max allowed values from the data.

<table data-header-hidden><thead><tr><th width="270.5"></th><th></th></tr></thead><tbody><tr><td>(default) <code>False</code></td><td>Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present.</td></tr><tr><td><code>True</code></td><td>Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value.</td></tr></tbody></table>

**`learn_rounding_scheme`**: Add this argument to allow the transformer to learn about rounded values in your dataset.

<table data-header-hidden><thead><tr><th width="249.5"></th><th></th></tr></thead><tbody><tr><td>(default) <code>False</code></td><td>Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.</td></tr><tr><td><code>True</code></td><td>Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.</td></tr></tbody></table>

## FAQ

<details>

<summary>When should I use this transformer?</summary>

Your decision to use this transformer is based on how you plan to use the transformed data. For example, algorithms such as the [Gaussian Copula ](https://en.wikipedia.org/wiki/Copula_\(probability_theory\)#Gaussian_copula)require normalized data. If you're planning to use such an algorithm, this transformer might be a good pre-processing step.

</details>

<details>

<summary>Which algorithm does this transformer use to normalize the data?</summary>

This transformer uses a [Probability Integral Transform](https://en.wikipedia.org/wiki/Probability_integral_transform) to transform the original data into a uniform distribution. From there, it converts the data to a standard normal (Gaussian) distribution.

</details>

<details>

<summary>Can you define the mathematical terms?</summary>

Below are some definitions for the mathematical terms we've used in this doc.

* A **distribution** is mathematical formula that describes the overall shape of data. A distribution has parameters that precisely describe it. For example a bell curve is a distribution with parameters for mean and standard deviation.
* A **parametric distribution** is a distribution that has a preset number of parameters with specific meanings. For example, a bell curve is a parametric distribution because we know it has 2 parameters (mean and standard deviation)
* A **gaussian** or **standard normal** distribution is a bell curve with mean = 0 and standard deviation = 1. Other distribution names such as **gamma**, **beta** and **student t** have precise meanings. Refer to [this list of probability distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions) for more info.

</details>

<details>

<summary>How does the <code>distribution</code> parameter affect the transformation?</summary>

The `GaussianNormalizer` approximates the column's shape (aka distribution) by searching through multiple options. The more accurate the approximation, the better the accuracy. However, there is a tradeoff between accuracy and the transformation time.

* Searching through more distributions takes a longer time but leads to greater accuracy. To save time, you can input a specific distribution if you already know the specific shape of the column.
* Searching through parametric distributions is faster that non-parametric distributions but can have lower accuracy. For the highest accuracy (that takes the longest amount of time) use the non-parametric `gaussian_kde` distribution.

</details>

<details>

<summary>When are the min/max and rounding schemes enforced?</summary>

Using these options will enforce the min/max values or rounding scheme when reverse transforming your data. Use these parameters if you want to recover data in the same format as the original.

</details>

<details>

<summary>When is it necessary to model missing values?</summary>

When setting the `missing_value_generation` parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.

</details>
