DisclosureProtectionEstimate
Last updated
Last updated
This metric provides an estimate of the overall DisclosureProtection by subsetting your data and averaging across several, smaller iterations. Use this if your data is too large for the regular DisclosureProtection metric.
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on booleans because it is a type of categorical data
Numerical: This metric works on numerical data by discretizing it into categories
Datetime: This metric works on datetime data by discretizing it into categories
Missing values are supported. This metric considers missing values as a single, separate category value.
(best) 1.0: The synthetic data is estimated to provide a strong disclosure protection. Sharing the synthetic data provides no more risk than sharing completely random values.
(worst) 0.0: The synthetic data is estimated to not provide disclosure protection. Sharing the synthetic data divulges patterns that make it easy to guess sensitive attributes.
Scores between 0.0 and 1.0 indicate the relative risk of disclosure. For example, a score of 0.825 indicates that the synthetic data has 82.5% of the protection that random data would provide.
This metric is designed to estimate the value of DisclosureProtection using the following algorithm:
Take a random subsample from the overall real and synthetic datasets.
Compute the DisclosureProtection score on the subsamples. This runs faster because the subsamples are smaller than the full datasets.
Repeat the process in steps 1 and 2 many iterations, sampling with replacement between each iteration.
Report the average score as the final score.
Access this metric from the single_table
module and use the compute
method.
Parameters This metric has the same parameters as DisclosureProtectionEstimate
(required) real_data
: A pandas.DataFrame containing the real data
(required) synthetic_data
: A pandas.DataFrame containing the same columns of synthetic data
(required) known_column_names
: A list of strings representing the column names that the attacker already knows
(required) sensitive_column_names
: A list of string representing the column names that the attacker wants to guess
continuous_column_names
: A list of column names that represent continuous values. Identify any of the column names (known or sensitive) that need discretization.
(default) None
: Assume none of the columns need discretization
num_discrete_bins
: For any continuous columns that need discretization, this parameter represents the number of bins to create
(default) 10
: Discretize continuous columns into 10 bins
computation
: The type of computation we'll use to simulate the attack. Options are:
(default) 'cap'
: Use the CAP method described in the original paper
'generalized_cap'
: Use the Generalized CAP method
'zero_cap'
: Use the Zero CAP method
num_rows_subsample
: An integer describing the number of rows to subsample in each of the real and synthetic datasets
(default) 1000
: Subsample 1000 rows in both the real and synthetic data
<int>
: Subsample the number of rows provided
num_iterations
: The number of iterations to perform before determining the final score
(default) 10
: Perform 10 iterations
<int>
: Perform the number of iterations provided
verbose
: A boolean describing whether to show the progress
(default) True
: Show the progress of each iteration and the updating score
False
: Do not show any score
Alternatively, you can use the compute_breakdown
method with the same parameters. This returns the individual scores for CAP and baseline.