# Privacy Against Inference

Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.
The attacker can use various algorithms to make the guesses. Each is covered by a different metric:
• Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
• Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

## Data Compatibility

• Categorical/Boolean: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
• Numerical: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
Choose a metric depending on the type of data that the attacker is guessing. The `key_fields` and `sensitive_fields` must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.

## Score

(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.
(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.

## How does it work?

We assume that the attacker is in possession of
• few columns of the real data (`key_fields`), as well as
• the full synthetic dataset, including synthetic sensitive values
The attacker's goal is to correctly guess the real value of the sensitive information, `sensitive_fields`. An example is shown below. In this example, we assume the `key_fields` are a person's age bracket and gender. Meanwhile, the `sensitive_fields` are the person's political affiliation; this is what the attacker wants to guess.
To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.

## Usage

Access this metric from the `single_table` module and use the `compute` method.
from sdmetrics.single_table import CategoricalKNN
CategoricalKNN.compute(
real_data=real_table,
synthetic_data=synthetic_table,
key_fields=['age_bracket', 'gender'],
sensitive_fields=['political_affiliation']
)
Parameters
• (required) `real_data`: A pandas.DataFrame containing the real data
• (required) `synthetic_data`: A pandas.DataFrame containing the same columns of synthetic data
• (required) `key_fields`: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.
• (required) `sensitive_fields`: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.
• `metadata`: A description of the dataset. See Single Table Metadata
• `**kwargs`: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm
Metrics
Numerical Data
Categorical Data
Use these metrics if the key and sensitive fields are numerical, representing continuous data.
• NumericalMLP
• NumericalLR
• NumericalSVR