LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • Attributes
  • FAQs
  1. Transformers Glossary
  2. Numerical

* OutlierEncoder

PreviousLogitScalerNext❖ DPECDFNormalizer

Last updated 6 months ago

Compatibility: numerical data

The OutlierEncoder identifies the outliers to the left and right of the main data, and encodes this information in a new column. Then, it removes the outliers from the original column to make it easier for future data science use.

from rdt.transformers.numerical import OutlierEncoder

transformer = OutlierEncoder()

Parameters

distribution: The transformer approximates the shape (aka distribution) of the main values as well as the outliers. Use this parameter to specify the shape.

(default) 'uniform'

Estimate the main values and outliers as uniform distributions

'truncnorm'

Estimate the main values and outliers using a truncated Gaussian distribution.

Attributes

After fitting the transformer, you can access the learned values through the attributes.

box_plot_summary: A dictionary that stores the min, max and quartile values for the overall column

>>> transformer.box_plot_summary
{
  'min': 0.0,
  'Q1': 5.0,
  'Q2': 10.50
  'Q3': 25.0,
  'max': 10000.0
}
>>> transformer.iqr
20.0

outlier_ranges: A dictionary that maps 'left_outliers' to the left outlier ranges and 'right_outliers' to the right outlier range. These may be None if there are no outliers.

>>> transformer.outlier_ranges
{
  'left_outliers': None,
  'right_outliers': [55.0, 10000.0]
}

learned_distributions: A dictionary that maps 'left_outliers', 'main' and 'right_outliers' to the learned distribution for each area. These may be None if there are no values in the area.

>>> my_transformer.learned_distributions
{
  'LEFT_OUTLIER': None,
  'MAIN': { 
    'distribution': 'uniform',
    'learned_parameters': { 'scale': 1.2, 'loc': 25.0 },
  },
  'RIGHT_OUTLIER': { 
    'distribution': 'uniform',
    'learned_parameters': { 'scale': 1.2, 'loc': 40.0 }
  }
}

FAQs

When should I use this transformer?

This transformer is designed for numerical columns that contain outliers. The outliers may be on the left, right or both sides.

Will I see outliers again when I reverse transform the data?

Yes! If the initial data had outliers, the transformer will recreate outliers when reverse transforming the data. The outlier values it generates are estimates based on the learned parameters.

iqr: A float that represents the

Interquartile Range

*SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. For more information, visit our page to .

Explore SDV