CategoricalCAP
Looking to measure data privacy? We recommend using the DisclosureProtection metric instead of CategoricalCAP. The DisclosureProtection metric is based on CAP but provides additional features for ease-of-use:
Support for continuous columns (numerical and datetime) and for missing values
Better interpretation of the final score by comparing it against a baseline of random data
Faster computation for larger datasets — see DisclosureProtectionEstimate
The CategoricalCAP measures the risk of disclosing sensitive information through an inference attack. We assume that some values in the real data are public knowledge. An attacker is combining this with synthetic data to make guesses about other real values that are sensitive.
This metric describes how difficult it is for an attacker to correctly guess the sensitive information using an algorithm called Correct Attribution Probability (CAP).
Data Compatibility
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works well on boolean data
Continuous data (numerical and datetime columns) are not supported for this metric. Missing values are not supported for this metric.
For data support, please consider using the DisclosureProtection metric instead.
Score
(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the CAP algorithm.
(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the CAP algorithm.
Scores between 0.0 and 1.0 indicate the overall safety of the real data. For example a score of 0.55 indicates that the data is generally 55% safe from attack.
This score may be hard to interpret. If your sensitive column contains many possibilities, it is has a relatively higher data safety to begin with.
For a more easy-to-interpret score and baseline value, please consider using the DisclosureProtection metric instead.
How does it work?
We assume that the attacker is in possession of
few columns of the real data (
key_fields
), as well asthe full synthetic dataset, including synthetic sensitive values
The attacker's goal is to correctly guess the real value of the sensitive information, sensitive_fields
. An example is shown below.
In this metric, the attacker uses an algorithm called CAP [1], summarized below.
The CAP Algorithm
The attacker follows 4 steps to guess a sensitive value.
Pick a row (r) in the real dataset. Note down all the
key_fields
in r.In the synthetic data, find all the rows that match the
key_fields
of r. Call this set of rows, S. The set S is also known as the (synthetic) equivalence class of r.Each row in S will have synthetic values for the
sensitive_fields
. Let each of the values vote to guess thesensitive_fields
of the real row, r.The final score is the frequency of votes that are actually correct for all sensitive fields. This value must be between 0 and 1.
An illustration of this algorithm is shown below.
We repeat the attack for all rows (r) in the real data and calculate an overall probability of guessing the sensitive column correctly. The metric returns 1-probability
so that a higher score means higher privacy.
Variants of CAP
The SDMetrics library supports variants CAP algorithm, based on what the attacker can do with the equivalence class.
CategoricalZeroCAP: If there are no rows in the equivalence class, then the attacker records a failure (
0
) to guess that rowCategoricalGeneralizedCAP: If there are no rows in the equivalence class, the attacker looks for closest matches instead of exact matches [2]. Because the
key_fields
are all discrete, the attacker finds the closest matches by using the hamming distance [3]. This ensures that there is always at least 1 row in the equivalence class, which means that there is always at least 1 guess.
Usage
Access this metric from the single_table
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.DataFrame containing the real data(required)
synthetic_data
: A pandas.DataFrame containing the same columns of synthetic data(required)
key_fields
: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.(required)
sensitive_fields
: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.
You can also import CategoricalZeroCAP
and CategoricalGeneralizedCAP
and use them in the same way.
FAQs
References
[1] Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team
[2] A Baseline for Attribute Disclosure Risk in Synthetic Data
[3] https://en.wikipedia.org/wiki/Hamming_distance
Last updated