DisclosureProtection
Last updated
Last updated
The DisclosureProtection metric measures the risk associated with disclosing (aka broadly sharing) the synthetic data. It's a useful measurement if you want to know whether synthetic data is leaking patterns that pertain to sensitive information.
This metric simulates the attack scenario using your real and synthetic data. It describes how much your synthetic data protects against the risk of disclosure as compared to a baseline of completely random data.
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric works on numerical data by discretizing it into categories
Datetime: This metric works on datetime data by discretizing it into categories
Missing values are supported. This metric considers missing values as a single, separate category value.
(best) 1.0: The synthetic data provides strong disclosure protection. Sharing the synthetic data provides no more risk than sharing completely random values.
(worst) 0.0: The synthetic data does not provide disclosure protection. Sharing the synthetic data divulges patterns that make it easy to guess sensitive attributes.
Scores between 0.0 and 1.0 indicate the relative risk of disclosure. For example, a score of 0.825 indicates that the synthetic data has 82.5% of the protection that random data would provide.
To simulate the attack scenario, we use the real data. We pretend the attacker knows a few columns of the real data (known columns) and wants to guess other columns (sensitive columns). The attacker also has a full synthetic dataset.
To compute this metric, we assume the attacker uses an algorithm called CAP [1] to make guesses based on the synthetic data. We baseline this with the protection that completely random data would provide in place of the synthetic.
The baseline is the protection offered by disclosing completely random data. To compute this, we compute the total number of combinations that are possible to guess across all sensitive columns. We invert the score so that higher value means more protection.
The CAP score is the protection offered by disclosing synthetic data. To compute this, we simulate the attacker following the 4-step algorithm defined below:
Pick a row (r) in the real dataset. Note down all the known columns in r.
In the synthetic data, find all the rows that match the known columns of r. Call this set of rows S, also known as the (synthetic) equivalence class of r.
Each row in S will have synthetic values for the sensitive columns. Let each of the values vote to guess the sensitive columns of the real row, r.
The safety score for the row is the frequency of votes that are incorrect for all sensitive columns. This value must be between 0 and 1.
An illustration of this algorithm is shown below.
We repeat the attack for all rows (r) in the real data. We average the safety scores to from the overall CAP protection score.
Access this metric from the single_table
module and use the compute
method.
Parameters
(required) real_data
: A pandas.DataFrame object containing the real data
(required) synthetic_data
: A pandas.DataFrame object containing the synthetic data
(required) known_column_names
: A list of strings representing the column names that the attacker already knows
(required) sensitive_column_names
: A list of string representing the column names that the attacker wants to guess
continuous_column_names
: A list of column names that represent continuous values. Identify any of the column names (known or sensitive) that need discretization.
(default) None
: Assume none of the columns need discretization
num_discrete_bins
: For any continuous columns that need discretization, this parameter represents the number of bins to create
(default) 10
: Discretize continuous columns into 10 bins
computation
: The type of computation we'll use to simulate the attack. Options are:
(default) 'cap'
: Use the CAP method described in the original paper
'generalized_cap'
: Use the Generalized CAP method
'zero_cap'
: Use the Zero CAP method
Alternatively, you can use the compute_breakdown
method with the same parameters. This returns the individual scores for CAP and baseline.
To speed up the computation, we recommend using the metric.
The DisclosureProtection metric provides everything that the CategoricalCAP metric used to provide — and more! To our legacy users, we strongly recommend moving to the DisclosureProtection metric instead. For a smooth transition, we have kept the metric in our library as a temporary measure.
[1]
[2]
[3]
[4]
[5]
20-29
and a gender of F
, the synthetic equivalent class (S) for it has the same. In this case, S has 4 rows. Each has a synthetic value for the sensitive political affiliation
.