LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • FAQs
  1. Metrics
  2. Metrics in Beta

Privacy Against Inference

Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.

The attacker can use various algorithms to make the guesses. Each is covered by a different metric:

  • Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

  • Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

Data Compatibility

  • Categorical/Boolean: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble

  • Numerical: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor

Choose a metric depending on the type of data that the attacker is guessing. The key_fields and sensitive_fields must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.

Score

(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.

(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.

How does it work?

We assume that the attacker is in possession of

  • few columns of the real data (key_fields), as well as

  • the full synthetic dataset, including synthetic sensitive values

The attacker's goal is to correctly guess the real value of the sensitive information, sensitive_fields. An example is shown below.

To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import CategoricalKNN

CategoricalKNN.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    key_fields=['age_bracket', 'gender'],
    sensitive_fields=['political_affiliation']
)

Parameters

  • (required) real_data: A pandas.DataFrame containing the real data

  • (required) synthetic_data: A pandas.DataFrame containing the same columns of synthetic data

  • (required) key_fields: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.

  • (required) sensitive_fields: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.

  • **kwargs: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm

Metrics

Use these metrics if the key and sensitive fields are numerical, representing continuous data.

  • NumericalMLP

  • NumericalLR

  • NumericalSVR

  • NumericalRadiusNearestNeighbor

Use these metrics if the key and sensitive fields are categorical, representing discrete data.

  • CategoricalKNN

  • CategoricalNB

  • CategoricalRF

  • CategoricalEnsemble

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

  • The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.

  • In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)

What other metrics can I use to measure privacy?
Previous* OutlierCoverageNext* SmoothnessSimilarity

Last updated 2 years ago

metadata: A description of the dataset. See

The metric also measures privacy using a similar methodology. In this metric, the attacker uses an inference algorithm called Correct Attribution Prediction (CAP). We recommend using this metric, as the CAP algorithm has been well studied for the purposes of evaluating synthetic data. This algorithm is also closely related to the privacy concepts of k-anonymity and l-diversity.

Single Table Metadata
CategoricalCAP
In this example, we assume the key_fields are a person's age bracket and gender. Meanwhile, the sensitive_fields are the person's political affiliation; this is what the attacker wants to guess.