K

# CategoricalCAP

The CategoricalCAP measures the risk of disclosing sensitive information through an inference attack. We assume that some values in the real data are public knowledge. An attacker is combining this with synthetic data to make guesses about other real values that are sensitive.
This metric describes how difficult it is for an attacker to correctly guess the sensitive information using an algorithm called Correct Attribution Probability (CAP).

## Data Compatibility

• Categorical: This metric is meant for discrete, categorical data
• Boolean: This metric works well on boolean data
Missing values are not yet supported for this metric. If your columns contain missing values, consider removing them or creating a different category to denote what they are.

## Score

(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the CAP algorithm.
(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the CAP algorithm.
Scores between 0.0 and 1.0 indicate the overall safety of the real data. For example a score of 0.55 indicates that the data is generally 55% safe from attack.
This image shows the some sensitive real data in comparison to an attacker's guesses. The probability of the attacker correctly guessing the real value is highlighted. We can use this to determine the overall safety of the data. In this case, the CategoricalCAP score is 0.55, meaning the data is on average 55% safe from attack.
Tip! When interpreting this score, it's helpful to think through the chances of an attacker gaining possession of the real data and the risks of correctly guessing sensitive values.

## How does it work?

We assume that the attacker is in possession of
• few columns of the real data (`key_fields`), as well as
• the full synthetic dataset, including synthetic sensitive values
The attacker's goal is to correctly guess the real value of the sensitive information, `sensitive_fields`. An example is shown below.
In this example, we assume the `key_fields` are a person's age bracket and gender. Meanwhile, the `sensitive_fields` are the person's political affiliation; this is what the attacker wants to guess.
In this metric, the attacker uses an algorithm called CAP [1], summarized below.

### The CAP Algorithm

The attacker follows 4 steps to guess a sensitive value.
1. 1.
Pick a row (r) in the real dataset. Note down all the `key_fields` in r.
2. 2.
In the synthetic data, find all the rows that match the `key_fields` of r. Call this set of rows, S. The set S is also known as the (synthetic) equivalence class of r.
3. 3.
Each row in S will have synthetic values for the `sensitive_fields`. Let each of the values vote to guess the `sensitive_fields` of the real row, r.
4. 4.
The final score is the frequency of votes that are actually correct for all sensitive fields. This value that must be between 0 and 1.
An illustration of this algorithm is shown below.
If a real row has an age bracket of `20-29` and a gender of `F`, the synthetic equivalent class (S) for it has the same. In this case, S has 4 rows. Each has a synthetic value for the sensitive `political affiliation`.
We repeat the attack for all rows (r) in the real data and calculate an overall probability of guessing the sensitive column correctly. The metric returns `1-probability` so that a higher score means higher privacy.

### Variants of CAP

The SDMetrics library supports variants CAP algorithm, based on what the attacker can do with the equivalence class.
• CategoricalZeroCAP: If there are no rows in the equivalence class, then the attacker records a failure (`0`) to guess that row
• CategoricalGeneralizedCAP: If there are no rows in the equivalence class, the attacker looks for closest matches instead of exact matches [2]. Because the `key_fields` are all discrete, the attacker finds the closest matches by using the hamming distance [3]. This ensures that there is always at least 1 row in the equivalence class, which means that there is always at least 1 guess.

## Usage

Access this metric from the `single_table` module and use the `compute` method.
from sdmetrics.single_table import CategoricalCAP
score = CategoricalCAP.compute(
real_data=real_table,
synthetic_data=synthetic_table,
key_fields=['age_bracket', 'gender'],
sensitive_fields=['political_affiliation']
)
Parameters
• (required) `real_data`: A pandas.DataFrame containing the real data
• (required) `synthetic_data`: A pandas.DataFrame containing the same columns of synthetic data
• (required) `key_fields`: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.
• (required) `sensitive_fields`: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.
You can also import `CategoricalZeroCAP` and `CategoricalGeneralizedCAP` and use them in the same way.
from sdmetrics.single_table import CategoricalZeroCAP, CategoricalGeneralizedCAP
score = CategoricalZeroCAP.compute(
real_data=real_table,
synthetic_data=synthetic_table,
key_fields=['age_bracket', 'gender'],
sensitive_fields=['political_affiliation'])

## FAQs

How does CAP compare to k-anonymity?
The concept of k-anonymity captures the number of rows in a dataset that cannot be distinguished from each other [4]. A higher value of k means that it's harder for an attacker to distinguish an individual.
The CAP is related to k-anonymity. If there are at least k rows in every synthetic equivalence class, then the synthetic dataset has k-anonymity for the known values [2].
How does CAP compare to l-diversity?
The concept of ℓ-diversity measures the uniqueness of sensitive values in a dataset [5]. A higher value of ℓ means that it's harder for an attacker to guess a particular sensitive value.
The CAP is related to ℓ-diversity. If there are at least ℓ unique sensitive values in every synthetic equivalence class, then the synthetic dataset has ℓ-diversity [2].