LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  1. Metrics
  2. Privacy Metrics

DisclosureProtectionEstimate

PreviousDisclosureProtectionNextCategoricalCAP

Last updated 1 month ago

This metric provides an estimate of the overall by subsetting your data and averaging across several, smaller iterations. Use this if your data is too large for the regular DisclosureProtection metric.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric works on booleans because it is a type of categorical data

  • Numerical: This metric works on numerical data by discretizing it into categories

  • Datetime: This metric works on datetime data by discretizing it into categories

Missing values are supported. This metric considers missing values as a single, separate category value.

Score

(best) 1.0: The synthetic data is estimated to provide a strong disclosure protection. Sharing the synthetic data provides no more risk than sharing completely random values.

(worst) 0.0: The synthetic data is estimated to not provide disclosure protection. Sharing the synthetic data divulges patterns that make it easy to guess sensitive attributes.

Scores between 0.0 and 1.0 indicate the relative risk of disclosure. For example, a score of 0.825 indicates that the synthetic data has 82.5% of the protection that random data would provide.

How does it work?

  1. Take a random subsample from the overall real and synthetic datasets.

  2. Repeat the process in steps 1 and 2 many iterations, sampling with replacement between each iteration.

  3. Report the average score as the final score.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import DisclosureProtectionEstimate

score = DisclosureProtectionEstimate.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_column_names=['age_bracket', 'gender'],
    sensitive_column_names=['political_affiliation'],
    num_rows_subsample=2500,
    num_iterations=100,
    verbose=True
)
Estimating Disclosure Protection (Score=0.8250): 100%|█████████████████| 100/100

Parameters This metric has the same parameters as DisclosureProtectionEstimate

  • (required) real_data: A pandas.DataFrame object containing the real data

  • (required) synthetic_data: A pandas.DataFrame object containing synthetic data

  • (required) known_column_names: A list of strings representing the column names that the attacker already knows

  • (required) sensitive_column_names: A list of string representing the column names that the attacker wants to guess

  • continuous_column_names: A list of column names that represent continuous values. Identify any of the column names (known or sensitive) that need discretization.

    • (default) None: Assume none of the columns need discretization

  • num_discrete_bins: For any continuous columns that need discretization, this parameter represents the number of bins to create

    • (default) 10: Discretize continuous columns into 10 bins

  • computation: The type of computation we'll use to simulate the attack. Options are:

    • (default) 'cap': Use the CAP method described in the original paper

    • 'generalized_cap': Use the Generalized CAP method

    • 'zero_cap': Use the Zero CAP method

  • num_rows_subsample: An integer describing the number of rows to subsample in each of the real and synthetic datasets

    • (default) 1000: Subsample 1000 rows in both the real and synthetic data

    • <int>: Subsample the number of rows provided

  • num_iterations: The number of iterations to perform before determining the final score

    • (default) 10: Perform 10 iterations

    • <int>: Perform the number of iterations provided

  • verbose: A boolean describing whether to show the progress

    • (default) True: Show the progress of each iteration and the updating score

    • False: Do not show any score

Alternatively, you can use the compute_breakdown method with the same parameters. This returns the individual scores for CAP and baseline.

from sdmetrics.single_table import DisclosureProtectionEstimate

score = DisclosureProtectionEstimate.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_column_names=['age', 'gender'],
    sensitive_column_names=['political_affiliation'],
    continuous_column_names=['age'],
    num_rows_subsample=2500,
    num_iterations=100,
    verbose=True
)
{
    'score': 0.825
    'cap_protection_estimate': 0.55,
    'baseline_protection': 0.66666666
}

This metric is designed to estimate the value of using the following algorithm:

Compute the score on the subsamples. This runs faster because the subsamples are smaller than the full datasets.

DisclosureProtection
DisclosureProtection
DisclosureProtection
The average data safety across all rows is is 55% for synthetic data. As a baseline, random data would have a safety of 66.66% (because there is a ⅓ chance of guessing the value correctly). The overall DisclosureProtection risk is then 82.5%.