DisclosureProtection

The DisclosureProtection metric measures the risk associated with disclosing (aka broadly sharing) the synthetic data. It's a useful measurement if you want to know whether synthetic data is leaking patterns that pertain to sensitive information.

The attack scenario: If an attacker has prior knowledge about certain attributes (e.g. a person's age and gender) and they are given access to the full synthetic data, would they be able to make better guesses about what they don't know (e.g. that person's political affiliation)?

This metric simulates the attack scenario using your real and synthetic data. It describes how much your synthetic data protects against the risk of disclosure as compared to a baseline of completely random data.

Data Compatibility

Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric works on numerical data by discretizing it into categories
Datetime: This metric works on datetime data by discretizing it into categories

Missing values are supported. This metric considers missing values as a single, separate category value.

Score

(best) 1.0: The synthetic data provides strong disclosure protection. Sharing the synthetic data provides no more risk than sharing completely random values.

(worst) 0.0: The synthetic data does not provide disclosure protection. Sharing the synthetic data divulges patterns that make it easy to guess sensitive attributes.

Scores between 0.0 and 1.0 indicate the relative risk of disclosure. For example, a score of 0.825 indicates that the synthetic data has 82.5% of the protection that random data would provide.

How does it work?

To simulate the attack scenario, we use the real data. We pretend the attacker knows a few columns of the real data (known columns) and wants to guess other columns (sensitive columns). The attacker also has a full synthetic dataset.

To compute this metric, we assume the attacker uses an algorithm called CAP [1] to make guesses based on the synthetic data. We baseline this with the protection that completely random data would provide in place of the synthetic.

score = \min\left(\frac{cap\_protection}{baseline\_protection}, 1\right)

Baseline Protection

The baseline is the protection offered by disclosing completely random data. To compute this, we compute the total number of combinations that are possible to guess across all sensitive columns. We invert the score so that higher value means more protection.

baseline\_protection = 1 - \Pi_{sensitive\_cols} \left(\frac{1}{n\_values}\right)

The baseline protection is higher if there are more possible values to guess from, because there is a lower probability of randomly getting it right.

CAP Protection

The CAP score is the protection offered by disclosing synthetic data. To compute this, we simulate the attacker following the 4-step algorithm defined below:

Pick a row (r) in the real dataset. Note down all the known columns in r.
In the synthetic data, find all the rows that match the known columns of r. Call this set of rows S, also known as the (synthetic) equivalence class of r.
Each row in S will have synthetic values for the sensitive columns. Let each of the values vote to guess the sensitive columns of the real row, r.
The safety score for the row is the frequency of votes that are incorrect for all sensitive columns. This value must be between 0 and 1.

An illustration of this algorithm is shown below.

We repeat the attack for all rows (r) in the real data. We average the safety scores to from the overall CAP protection score.

What if there are no rows in the equivalence class? In the regular CAP algorithm, the attacker would simply ignore this computation and move onto the next row. Variants of CAP are available based on alternate logic:

Zero CAP: If there are no rows in the equivalence class, then the attacker records a failure (0) to guess that row
Generalized CAP: If there are no rows in the equivalence class, the attacker looks for closest matches instead of exact matches [2] by using the hamming distance [3]. This ensures that there is always at least 1 row in the equivalence class, which means that there is always at least 1 guess.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import DisclosureProtection

score = DisclosureProtection.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_column_names=['age_bracket', 'gender'],
    sensitive_column_names=['political_affiliation']
)

Parameters

(required) real_data: A pandas.DataFrame object containing the real data
(required) synthetic_data: A pandas.DataFrame object containing the synthetic data
(required) known_column_names: A list of strings representing the column names that the attacker already knows
(required) sensitive_column_names: A list of string representing the column names that the attacker wants to guess
continuous_column_names: A list of column names that represent continuous values. Identify any of the column names (known or sensitive) that need discretization.
- (default) None: Assume none of the columns need discretization
num_discrete_bins: For any continuous columns that need discretization, this parameter represents the number of bins to create
- (default) 10: Discretize continuous columns into 10 bins
computation: The type of computation we'll use to simulate the attack. Options are:
- (default) 'cap': Use the CAP method described in the original paper
- 'generalized_cap': Use the Generalized CAP method
- 'zero_cap': Use the Zero CAP method

Alternatively, you can use the compute_breakdown method with the same parameters. This returns the individual scores for CAP and baseline.

from sdmetrics.single_table import DisclosureProtection

score = DisclosureProtection.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_column_names=['age', 'gender'],
    sensitive_column_names=['political_affiliation'],
    continuous_column_names=['age']
)

{
    'score': 0.825
    'cap_protection': 0.55,
    'baseline_protection': 0.66666666
}

FAQs

Is there a way to speed up the computation?

This metric may take a long time to compute on larger datasets because the CAP algorithm computes a safety score for each row of real data compared to each row of synthetic data.

To speed up the computation, we recommend using the DisclosureProtectionEstimate metric.

What is the purpose of a baseline and why is it based on random data?

In the attack scenario for this metric, an attacker wants to correctly guess a sensitive value. The general difficulty of this task is highly dependent on the # of possible options to guess. For example, guessing between 2 values (eg. True or False) is much easier than guessing between ~200 values (eg. correctly guessing a country).

The baseline score is meant take this overall difficulty into account. It is based on random data because it is the least sensitive data you could possibly share, representing the best case scenario for data privacy.

How does DisclosureProtection compare to the CategoricalCAP metric?

Previous versions of SDMetrics included a CategoricalCAP metric that simulated the CAP algorithm. This metric was limited in use because it only worked on fully discretized data without any missing values. It provided cap_protection breakdown that you see today; it did not compute any baselines.

The DisclosureProtection metric provides everything that the CategoricalCAP metric used to provide — and more! To our legacy users, we strongly recommend moving to the DisclosureProtection metric instead. For a smooth transition, we have kept the CategoricalCAP metric in our library as a temporary measure.

How does CAP compare to k-anonymity?

The concept of k-anonymity captures the number of rows in a dataset that cannot be distinguished from each other [4]. A higher value of k means that it's harder for an attacker to distinguish an individual.

The CAP is related to k-anonymity. If there are at least k rows in every synthetic equivalence class, then the synthetic dataset has k-anonymity for the known values [2].

How does CAP compare to l-diversity?

The concept of ℓ-diversity measures the uniqueness of sensitive values in a dataset [5]. A higher value of ℓ means that it's harder for an attacker to guess a particular sensitive value.

The CAP is related to ℓ-diversity. If there are at least ℓ unique sensitive values in every synthetic equivalence class, then the synthetic dataset has ℓ-diversity [2].

References

[1] Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team

[2] A Baseline for Attribute Disclosure Risk in Synthetic Data

[3] https://en.wikipedia.org/wiki/Hamming_distance

[4] https://en.wikipedia.org/wiki/K-anonymity

[5] https://en.wikipedia.org/wiki/L-diversity

PreviousDCROverfittingProtection NextDisclosureProtectionEstimate

Last updated 5 months ago