❖ DPDiscreteECDFNormalizer

Compatibility: categorical data

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the Differential Privacy Bundle page. Coming soon!

The DPDiscreteECDFNormalizer uses differential privacy techniques to normalize your categorical values into a numerical column that is uniform or normal. To do this, estimates the empirical distribution and adds differentially private noise to your data. (On the reverse transform, this transformer brings the data back into the original category values.)

from rdt.transformers.categorical import DPDiscreteECDFNormalizer

transformer = DPDiscreteECDFNormalizer(
    epsilon=0.5,
    normalized_distribution='uniform'
)

Parameters

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.
Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.

order_by: Apply a prescribed ordering scheme. Use this if the discrete categorical values have an order.

(default) None

Do not apply a particular order

'numerical_value'

If the data is represented by integers or floats, order by those values

'alphabetical'

If the data is represented by strings, order them alphabetically.

normalized_distribution: Add this argument to control the shape of the transformed data. Choose whatever is easiest for your downstream use case.

(default) 'uniform'

Transform the data into a uniform distribution, between 0 and 1.

'norm'

Transform the data into a standard normal distribution, aka a bell curve with mean of 0 and standard deviation of 1.

FAQ

Which algorithms does this transformer use?

This transformer creates a bar chart of your data and uses it compute an empirical CDF distribution. The empirical CDF distribution can be used to normalize your data into a different shape (uniform or normal) using the probability integral transform.

Throughout the process, the uses uses ε-differentially private mechanisms to add controlled noise to the frequencies of each category value. For more information about this, see the Laplace mechanism.

How is the privacy loss budget (ε) used?

The privacy loss budget is used when saving the frequencies of each category value. This uses the using the Laplacian mechanism.

PreviousBinaryEncoder Next❖ DPResponseRandomizer

Last updated 2 months ago