LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • FAQ
  1. Transformers Glossary
  2. Categorical

❖ DPDiscreteECDFNormalizer

PreviousBinaryEncoderNext❖ DPResponseRandomizer

Last updated 16 days ago

Compatibility: categorical data

The DPDiscreteECDFNormalizer uses differential privacy techniques to normalize your categorical values into a numerical column that is uniform or normal. To do this, estimates the and adds to your data. (On the reverse transform, this transformer brings the data back into the original category values.)

from rdt.transformers.categorical import DPDiscreteECDFNormalizer

transformer = DPDiscreteECDFNormalizer(
    epsilon=0.5,
    normalized_distribution='uniform'
)

Parameters

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

order_by: Apply a prescribed ordering scheme. Use this if the discrete categorical values have an order.

(default) None

Do not apply a particular order

'numerical_value'

If the data is represented by integers or floats, order by those values

'alphabetical'

If the data is represented by strings, order them alphabetically.

normalized_distribution: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.

(default) 'uniform'

Transform the data into a uniform distribution, between 0 and 1.

'norm'

Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.

FAQ

Which algorithms does this transformer use?
How is the privacy loss budget (ε) used?
Can I share the data after applying this? What are the differential privacy guarantees?

Differential privacy controls the amount of influence a single data point can have over the final, transformed column. After applying the transformer to a column, the entire column provides differential privacy guarantees, so you should be able to share it as well as any statistics about it (mode, frequencies, etc.).

Please note that this transformer only applies differential privacy to the individual column. It does not provide differential privacy guarantees if you'd like to share multiple columns at a time. For that, we recommend using a differentially private synthesizer that can handle many columns at once.

This transformer creates a bar chart of your data and uses it compute an . The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the .

Throughout the process, the uses uses ε-differentially private mechanisms to add controlled noise to the frequencies of each category value. For more information about this, see the .

The privacy loss budget is used when saving the frequencies of each category value. This uses the using the .

empirical CDF distribution
probability integral transform
Laplace mechanism
Laplacian mechanism
empirical distribution
differentially private noise

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page. Coming soon!

Differential Privacy Bundle

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

  • Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.

  • Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.