Privacy Against Inference
Privacy Against Inference describes a set of metrics that calculate the risk of an attacker being able to infer real, sensitive values. We assume that an attacker already possess a few columns of real data; they will combine it with the synthetic data to make educated guesses.
The attacker can use various algorithms to make the guesses. Each is covered by a different metric:
Guessing numerical values: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
Guessing categorical values: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
Data Compatibility
Categorical/Boolean: Some metrics can be used for discrete, categorical data: CategoricalKNN, CategoricalNB, CategoricalRF, CategoricalEnsemble
Numerical: Some metrics can be used for numerical data: NumericalMLP, NumericalLR, NumericalSVR, NumericalRadiusNearestNeighbor
Choose a metric depending on the type of data that the attacker is guessing. The key_fields
and sensitive_fields
must all be of the same type. Note that missing values are not supported. Please remove or impute missing values before applying this metric.
Score
(best) 1.0: The real data is 100% safe from the attack. The attacker is not able to correctly guess any of the sensitive values by applying the chosen attack algorithm.
(worst) 0.0: The real data is not at all safe from the attack. The attacker is able to correctly guess every sensitive value by applying the chosen attack algorithm.
How does it work?
We assume that the attacker is in possession of
few columns of the real data (
key_fields
), as well asthe full synthetic dataset, including synthetic sensitive values
The attacker's goal is to correctly guess the real value of the sensitive information, sensitive_fields
. An example is shown below.
To make the guesses, the attacker uses a machine learning algorithm based on the type of data that they want to guess.
Usage
Access this metric from the single_table
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.DataFrame containing the real data(required)
synthetic_data
: A pandas.DataFrame containing the same columns of synthetic data(required)
key_fields
: A list of strings representing the column names that the attacker already knows. These must be compatible with the metric.(required)
sensitive_fields
: A list of string representing the column names that the attacker wants to guess. These must be compatible with the metric.metadata
: A description of the dataset. See Single Table Metadata**kwargs
: Optional keyword args that allow you to customize the model. These args are directly passed into the scikit-learn algorithm
Metrics
Use these metrics if the key and sensitive fields are numerical, representing continuous data.
NumericalMLP
NumericalLR
NumericalSVR
NumericalRadiusNearestNeighbor
FAQs
This metric is in Beta. Be careful when using the metric and interpreting its score.
The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.
In a real world scenario, an attacker may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (LR, MLP, etc.)
Last updated