DisclosureProtection
Last updated
Last updated
The DisclosureProtection metric measures the risk associated with disclosing (aka broadly sharing) the synthetic data. It's a useful measurement if you want to know whether synthetic data is leaking patterns that pertain to sensitive information.
The attack scenario: If an attacker has prior knowledge about certain attributes (e.g. a person's age and gender) and they are given access to the full synthetic data, would they be able to make better guesses about what they don't know (e.g. that person's political affiliation)?
This metric simulates the attack scenario using your real and synthetic data. It describes how much your synthetic data protects against the risk of disclosure as compared to a baseline of completely random data.
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric works on numerical data by discretizing it into categories
Datetime: This metric works on datetime data by discretizing it into categories
Missing values are supported. This metric considers missing values as a single, separate category value.
(best) 1.0: The synthetic data provides strong disclosure protection. Sharing the synthetic data provides no more risk than sharing completely random values.
(worst) 0.0: The synthetic data does not provide disclosure protection. Sharing the synthetic data divulges patterns that make it easy to guess sensitive attributes.
Scores between 0.0 and 1.0 indicate the relative risk of disclosure. For example, a score of 0.825 indicates that the synthetic data has 82.5% of the protection that random data would provide.
To simulate the attack scenario, we use the real data. We pretend the attacker knows a few columns of the real data (known columns) and wants to guess other columns (sensitive columns). The attacker also has a full synthetic dataset.
To compute this metric, we assume the attacker uses an algorithm called CAP [1] to make guesses based on the synthetic data. We baseline this with the protection that completely random data would provide in place of the synthetic.
The baseline is the protection offered by disclosing completely random data. To compute this, we compute the total number of combinations that are possible to guess across all sensitive columns. We invert the score so that higher value means more protection.
The baseline protection is higher if there are more possible values to guess from, because there is a lower probability of randomly getting it right.
The CAP score is the protection offered by disclosing synthetic data. To compute this, we simulate the attacker following the 4-step algorithm defined below:
Pick a row (r) in the real dataset. Note down all the known columns in r.
In the synthetic data, find all the rows that match the known columns of r. Call this set of rows S, also known as the (synthetic) equivalence class of r.
Each row in S will have synthetic values for the sensitive columns. Let each of the values vote to guess the sensitive columns of the real row, r.
The safety score for the row is the frequency of votes that are incorrect for all sensitive columns. This value must be between 0 and 1.
An illustration of this algorithm is shown below.
We repeat the attack for all rows (r) in the real data. We average the safety scores to from the overall CAP protection score.
What if there are no rows in the equivalence class? In the regular CAP algorithm, the attacker would simply ignore this computation and move onto the next row. Variants of CAP are available based on alternate logic:
Zero CAP: If there are no rows in the equivalence class, then the attacker records a failure (0
) to guess that row
Generalized CAP: If there are no rows in the equivalence class, the attacker looks for closest matches instead of exact matches [2] by using the hamming distance [3]. This ensures that there is always at least 1 row in the equivalence class, which means that there is always at least 1 guess.
Access this metric from the single_table
module and use the compute
method.
Parameters
(required) real_data
: A pandas.DataFrame containing the real data
(required) synthetic_data
: A pandas.DataFrame containing the same columns of synthetic data
(required) known_column_names
: A list of strings representing the column names that the attacker already knows
(required) sensitive_column_names
: A list of string representing the column names that the attacker wants to guess
continuous_column_names
: A list of column names that represent continuous values. Identify any of the column names (known or sensitive) that need discretization.
(default) None
: Assume none of the columns need discretization
num_discrete_bins
: For any continuous columns that need discretization, this parameter represents the number of bins to create
(default) 10
: Discretize continuous columns into 10 bins
computation
: The type of computation we'll use to simulate the attack. Options are:
(default) 'cap'
: Use the CAP method described in the original paper
'generalized_cap'
: Use the Generalized CAP method
'zero_cap'
: Use the Zero CAP method
Alternatively, you can use the compute_breakdown
method with the same parameters. This returns the individual scores for CAP and baseline.
[1] Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team
[2] A Baseline for Attribute Disclosure Risk in Synthetic Data
[3] https://en.wikipedia.org/wiki/Hamming_distance
20-29
and a gender of F
, the synthetic equivalent class (S) for it has the same. In this case, S has 4 rows. Each has a synthetic value for the sensitive political affiliation
.