# Privacy Against Inference

Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.

The attacker can use various algorithms to make the guesses. Each is covered by a different metric:

Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

## Data Compatibility

**Categorical/Boolean**: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble**Numerical**: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

Choose a metric depending on the type of data that the attacker is guessing. The `key_fields`

and `sensitive_fields`

must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.

## Score

**(best) 1.0**: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.

**(worst) 0.0**: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.

## How does it work?

We assume that the attacker is in possession of

few columns of the real data (

`key_fields`

), as well asthe full synthetic dataset, including synthetic sensitive values

The attacker's goal is to correctly guess the real value of the sensitive information, `sensitive_fields`

. An example is shown below.

To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.

## Usage

Access this metric from the `single_table`

module and use the `compute`

method.

**Parameters**

(required)

`real_data`

: A pandas.DataFrame containing the real data(required)

`synthetic_data`

: A pandas.DataFrame containing the same columns of synthetic data(required)

`key_fields`

: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.(required)

`sensitive_fields`

: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.`metadata`

: A description of the dataset. See Single Table Metadata`**kwargs`

: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm

**Metrics**

Use these metrics if the key and sensitive fields are numerical, representing continuous data.

NumericalMLP

NumericalLR

NumericalSVR

NumericalRadiusNearestNeighbor

## FAQs

**This metric is in Beta. **Be careful when using the metric and interpreting its score.

The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.

In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)

Last updated