LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • Examples
  • Attributes
  • FAQs
  1. Transformers Glossary
  2. Categorical

UniformEncoder

PreviousOrderedUniformEncoderNextBinaryEncoder

Last updated 19 days ago

Compatibility: categorical or boolean data

The UniformEncoder transforms data that represents categorical values into a uniform distribution in the [0,1] interval. It is highly accurate at preserving the overall frequencies of each category.

from rdt.transformers.categorical import UniformEncoder

transformer = UniformEncoder()

Parameters

order_by: Apply a prescribed ordering scheme. Use this if the discrete categorical values have an order.

(default) None

Do not apply a particular order

'numerical_value'

If the data is represented by integers or floats, order by those values

'alphabetical'

If the data is represented by strings, order them alphabetically.

Examples

from rdt.transformers.categorical import UniformEncoder

transformer = UniformEncoder(
    order_by='alphabetical'
)

The transformer assigns each category to a unique, non-overlapping subset of the [0,1] interval. The length of the interval is based on the category's frequency. For example if category 'CASH' occurs with 60% frequency, the subset will have the length 0.6 such as [0.2, 0.8].

Attributes

After fitting the transformer, you can access the learned values through the attributes.

frequencies: A dictionary that maps each category value to the observed frequency, as a float between 0 and 1

>>> transformer.frequencies
{
  'CREDIT': 0.2, 
  'CASH': 0.6,
  'DEBIT': 0.2
}

intervals: A dictionary that maps each category value to an interval between [0,1]. This allows you to determine the exact rules used for transforming and reverse transforming.

>>> transformer.intervals
{
  'CREDIT': [0, 0.2],
  'CASH': [0.2, 0.8],
  'DEBIT': [0.8, 1.0]
}

FAQs

When should I use this transformer?

The UniformEncoder is shown to preserve the frequency of each category value with high accuracy. This is especially useful if you have a data imbalance, for example if True occurs only 1% of the time while False occurs 99% of the time.

When should I use the order_by parameter?

Use this parameter when the categorical data is ordinal (has a specific order) and the order can easily be discovered through sorting. For example, you might storing survey responses as 'response_00', 'response_01', 'response_02', etc.

Don't add this parameter if it isn't necessary. Ordering increases the time it takes for transformation.

What if I'd like to sort the values by a custom order?
What happens to missing values?

This transformer treats missing values as if they are a new category of data.

In some cases, your categories may not have an alphanumeric ordering scheme. Use the to add your own, custom sorting order.

OrderedUniformEncoder