Privacy Against Inference
Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.
The attacker can use various algorithms to make the guesses. Each is covered by a different metric:
- Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
- Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
- Categorical/Boolean: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
- Numerical: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
Choose a metric depending on the type of data that the attacker is guessing. The
key_fields
and sensitive_fields
must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.
(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.
We assume that the attacker is in possession of
- few columns of the real data (
key_fields
), as well as - the full synthetic dataset, including synthetic sensitive values
The attacker's goal is to correctly guess the real value of the sensitive information,
sensitive_fields
. An example is shown below..png?alt=media&token=5d2d4cef-c98f-46f1-898a-44c6ac9119c7)
In this example, we assume the
key_fields
are a person's age bracket and gender. Meanwhile, the sensitive_fields
are the person's political affiliation; this is what the attacker wants to guess.To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.
Access this metric from the
single_table
module and use the compute
method.from sdmetrics.single_table import CategoricalKNN
CategoricalKNN.compute(
real_data=real_table,
synthetic_data=synthetic_table,
key_fields=['age_bracket', 'gender'],
sensitive_fields=['political_affiliation']
)
Parameters
- (required)
real_data
: A pandas.DataFrame containing the real data - (required)
synthetic_data
: A pandas.DataFrame containing the same columns of synthetic data - (required)
key_fields
: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric. - (required)
sensitive_fields
: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric. **kwargs
: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm
Metrics
Numerical Data
Categorical Data
Use these metrics if the key and sensitive fields are numerical, representing continuous data.
- NumericalMLP
- NumericalLR
- NumericalSVR
- NumericalRadiusNearestNeighbor
Use these metrics if the key and sensitive fields are categorical, representing discrete data.
- CategoricalKNN
- CategoricalNB
- CategoricalRF
- CategoricalEnsemble
This metric is in Beta. Be careful when using the metric and interpreting its score.
- The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.
- In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)
The CategoricalCAP metric also measures privacy using a similar methodology. In this metric, the attacker uses an inference algorithm called Correct Attribution Prediction (CAP). We recommend using this metric, as the CAP algorithm has been well studied for the purposes of evaluating synthetic data. This algorithm is also closely related to the privacy concepts of k-anonymity and l-diversity.
Last modified 1yr ago