# Privacy Against Inference

Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.

The attacker can use various algorithms to make the guesses. Each is covered by a different metric:

* Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
* Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

## Data Compatibility

* **Categorical/Boolean**: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
* **Numerical**: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

Choose a metric depending on the type of data that the attacker is guessing. The `key_fields` and `sensitive_fields` must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.

## Score

**(best) 1.0**: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.

**(worst) 0.0**: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.

## How does it work?

We assume that the attacker is in possession of

* few columns of the real data (`key_fields`), as well as
* the full synthetic dataset, including synthetic sensitive values

The attacker's goal is to correctly guess the real value of the sensitive information, `sensitive_fields`. An example is shown below.

<figure><img src="/files/CtOj7JubH0XftUFwcwnB" alt=""><figcaption><p>In this example, we assume the <code>key_fields</code> are a person's age bracket and gender. Meanwhile, the <code>sensitive_fields</code> are the person's political affiliation; this is what the attacker wants to guess.</p></figcaption></figure>

To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.

## Usage

Access this metric from the `single_table` module and use the `compute` method.

```python
from sdmetrics.single_table import CategoricalKNN

CategoricalKNN.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    key_fields=['age_bracket', 'gender'],
    sensitive_fields=['political_affiliation']
)
```

**Parameters**

* (required) `real_data`: A pandas.DataFrame containing the real data
* (required) `synthetic_data`: A pandas.DataFrame containing the same columns of synthetic data
* (required) `key_fields`: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.
* (required) `sensitive_fields`: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.
* `metadata`: A description of the dataset. See [Single Table Metadata](/sdmetrics/getting-started/metadata/single-table-metadata.md)
* `**kwargs`: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm

**Metrics**

{% tabs %}
{% tab title="Numerical Data" %}
Use these metrics if the key and sensitive fields are numerical, representing continuous data.

* NumericalMLP
* NumericalLR
* NumericalSVR
* NumericalRadiusNearestNeighbor
  {% endtab %}

{% tab title="Categorical Data" %}
Use these metrics if the key and sensitive fields are categorical, representing discrete data.

* CategoricalKNN
* CategoricalNB
* CategoricalRF
* CategoricalEnsemble
  {% endtab %}
  {% endtabs %}

## FAQs

{% hint style="info" %}
**This metric is in Beta.** Be careful when using the metric and interpreting its score.

* The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.
* In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)
  {% endhint %}

<details>

<summary>What other metrics can I use to measure privacy?</summary>

The [CategoricalCAP](/sdmetrics/data-metrics/privacy/categoricalcap.md) metric also measures privacy using a similar methodology. In this metric, the attacker uses an inference algorithm called Correct Attribution Prediction (CAP). We recommend using this metric, as the CAP algorithm has been well studied for the purposes of evaluating synthetic data. This algorithm is also closely related to the privacy concepts of k-anonymity and l-diversity.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdmetrics/data-metrics/metrics-in-beta/privacy-against-inference.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
