LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • FAQ
  1. Transformers Glossary
  2. Categorical

❖ DPDiscreteECDFNormalizer

PreviousBinaryEncoderNext❖ DPResponseRandomizer

Last updated 13 days ago

Compatibility: categorical data

The DPDiscreteECDFNormalizer uses differential privacy techniques to normalize your categorical values into a numerical column that is uniform or normal. To do this, estimates the empirical distribution and adds differentially private noise to your data. (On the reverse transform, this transformer brings the data back into the original category values.)

from rdt.transformers.categorical import DPDiscreteECDFNormalizer

transformer = DPDiscreteECDFNormalizer(
    epsilon=0.5,
    normalized_distribution='uniform'
)

Parameters

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

order_by: Apply a prescribed ordering scheme. Use this if the discrete categorical values have an order.

(default) None

Do not apply a particular order

'numerical_value'

If the data is represented by integers or floats, order by those values

'alphabetical'

If the data is represented by strings, order them alphabetically.

normalized_distribution: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.

(default) 'uniform'

Transform the data into a uniform distribution, between 0 and 1.

'norm'

Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.

FAQ

Which algorithms does this transformer use?

This transformer creates a bar chart of your data and uses it compute an empirical CDF distribution. The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the probability integral transform.

Throughout the process, the uses uses ε-differentially private mechanisms to add controlled noise to the frequencies of each category value. For more information about this, see the Laplace mechanism.

How is the privacy loss budget (ε) used?

The privacy loss budget is used when saving the frequencies of each category value. This uses the using the Laplacian mechanism.

Can I share the data after applying this? What are the differential privacy guarantees?

Differential privacy controls the amount of influence a single data point can have over the final, transformed column. After applying the transformer to a column, the entire column provides differential privacy guarantees, so you should be able to share it as well as any statistics about it (mode, frequencies, etc.).

Please note that this transformer only applies differential privacy to the individual column. It does not provide differential privacy guarantees if you'd like to share multiple columns at a time. For that, we recommend using a differentially private synthesizer that can handle many columns at once.

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the Differential Privacy Bundle page. Coming soon!

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

  • Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.

  • Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.