Privacy Against Inference

Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.

The attacker can use various algorithms to make the guesses. Each is covered by a different metric:

Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

Data Compatibility

Categorical/Boolean: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
Numerical: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

Choose a metric depending on the type of data that the attacker is guessing. The key_fields and sensitive_fields must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.

Score

(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.

(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.

How does it work?

We assume that the attacker is in possession of

few columns of the real data (key_fields), as well as
the full synthetic dataset, including synthetic sensitive values

The attacker's goal is to correctly guess the real value of the sensitive information, sensitive_fields. An example is shown below.

To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import CategoricalKNN

CategoricalKNN.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    key_fields=['age_bracket', 'gender'],
    sensitive_fields=['political_affiliation']
)

Parameters

(required) real_data: A pandas.DataFrame containing the real data
(required) synthetic_data: A pandas.DataFrame containing the same columns of synthetic data
(required) key_fields: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.
(required) sensitive_fields: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.
metadata: A description of the dataset. See Single Table Metadata
**kwargs: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm

Metrics

Use these metrics if the key and sensitive fields are numerical, representing continuous data.

NumericalMLP
NumericalLR
NumericalSVR
NumericalRadiusNearestNeighbor

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.
In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)

What other metrics can I use to measure privacy?

The CategoricalCAP metric also measures privacy using a similar methodology. In this metric, the attacker uses an inference algorithm called Correct Attribution Prediction (CAP). We recommend using this metric, as the CAP algorithm has been well studied for the purposes of evaluating synthetic data. This algorithm is also closely related to the privacy concepts of k-anonymity and l-diversity.

Previous＊ OutlierCoverage Next＊ SmoothnessSimilarity

Last updated 2 years ago