LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • FAQ
  1. Transformers Glossary
  2. Numerical

❖ DPECDFNormalizer

Previous* OutlierEncoderNext❖ DPLaplaceNoiser

Last updated 16 days ago

Compatibility: numerical data

The DPECDFNormalizer uses differential privacy techniques to normalize your data into a uniform or normal shape. To do this, estimates the and adds to your data. (On the reverse transform, this transformers brings the data back into its original shape.)

from rdt.transformers.numerical import DPECDFNormalizer

transformer = DPECDFNormalizer(
    epsilon=3.5,
    known_min_value=0
    known_max_value=100,
    normalized_distribution='uniform'
)

Parameters

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

known_min_value: A previously-known min value that the data must take. Providing this value will help to conserve the privacy budget and ultimately yield higher fidelity data for the same epsilon value.

The min value should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that the min value is prior knowledge that is not based on any computations of the real data.

(default) None

There is no known minimum value for the data. The transformer will compute one based on the fit data, using some privacy budget

<float>

The transformer will make sure the data will never be less than the value. This will not use up any privacy budget.

known_max_value: A previously-known max value that the data must take. Providing this value will help to conserve the privacy budget and ultimately yield higher fidelity data for the same epsilon value.

The max value should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that the max value is prior knowledge that is not based on any computations of the real data.

(default) None

There is no known maximum value for the data. The transformer will compute one based on the fit data, using some privacy budget

<float>

The transformer will make sure the data will never be greater than the value. This will not use up any privacy budget.

n_bins : This parameter controls the number of bins to divide the data into when computing the empirical distribution. You can think of these as the # of bars a histogram of the data would have.

(default) 25

Divide up the data into 25 bins when computing the empirical distribution

<int>

Divide up the data into the provided number of bins

missing_value_encoding: Add this argument to control how to encode missing values in the empirical distribution. Missing values can be binned together and represented as being in either the highest or lowest bin of the histogram.

(default) 'ecdf_low_bin'

Encode the missing values in the empirical CDF as the first (lowest) bin

'ecdf_high_bin'

Encoding the missing values in the empirical CDF as the final (highest) bin

normalized_distribution: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.

(default) 'uniform'

Transform the data into a uniform distribution, between 0 and 1.

'norm'

Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.

learn_rounding_scheme: Add this argument to allow the transformer to learn about rounded values in your dataset.

(default) False

Do not learn or enforce any rounding scheme. When reverse transforming the data, there may be many decimal places present.

True

Learn the rounding rules from the input data. When reverse transforming the data, round the number of digits to match the original.

FAQ

Which algorithms does this transformer use?
How is the privacy loss budget (ε) used?

The privacy loss budget is used during 2 possible phases of the transformation:

  • Computing differentially private min/max values from the data, so as not to reveal the actual min or max value that the data contains. This step is required if the known_min_value and known_max_value are not provided.

Can I share the data after applying this? What are the differential privacy guarantees?

Differential privacy controls the amount of influence a single data point can have over the final, transformed column. After applying the transformer to a column, the entire column provides differential privacy guarantees, so you should be able to share it as well as any statistics about it (min, max, mean, etc.).

Please note that this transformer only applies differential privacy to the individual column. It does not provide differential privacy guarantees if you'd like to share multiple columns at a time. For that, we recommend using a differentially private synthesizer that can handle many columns at once.

This transformer creates a histogram of your data and uses it compute an . The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the .

Throughout the process, the transformer uses ε-differentially private mechanisms to add controlled noise to everything learned about the data including the min/max boundaries, and frequencies of each histogram bin. For more information about this, see the .

Noising the empirical distribution (histogram frequencies) using the . This step is always performed.

empirical CDF distribution
probability integral transform
Laplace mechanism
Laplacian mechanism
empirical distribution
differentially private noise

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

  • Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.

  • Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page. Coming soon!

Differential Privacy Bundle