Detection: Single Table

Tabular Detection describes a set of metrics that calculate how difficult it is to tell apart the real data from the synthetic data.

This is done using a machine learning model. There are two different Detection metrics that use different ML algorithms: LogisticDetection and SVCDetection.

Data Compatibility

  • Boolean: These metrics convert boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).

  • Categorical: These metric convert categorical columns into multiple, one hot encoded columns.

  • Datetime: These metrics convert datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.

  • Numerical: These metrics are designed to work with numerical columns. They impute missing values using the mean

Note that these metric should not be used with ID columns that represent the primary or foreign keys of your table.

Score

(highest) 1.0: The machine learning model cannot identify the synthetic data apart from the real data

(lowest) 0.0: The machine learning model can perfectly identify synthetic data apart from the real data

Be careful when interpreting the score. A score of 1 may indicate high quality but it could also be a clue that the synthetic data is leaking privacy (for example, if the synthetic data is copying the rows in the real data).

How does it work?

All tabular detection metrics run through the following steps:

  1. Create a single, augmented table that has all the rows of real data and all the rows of synthetic data. Add an extra column to keep track of whether each original row is real or synthetic.

  2. Split the augmented data to create a training and validation sets.

  3. Choose a machine learning model based on the metric used (see below). Train the model on the training split. The model will predict whether each row is real or synthetic (ie predict the extra column we created in step #1)

  4. Validate the model on the validation set

  5. Repeat steps #2-4 multiple times

The final score is based on the average ROC AUC score [1] across all the cross validation splits.

Metrics

The metric you choose determines which ML algorithms are used to train and validate the data

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import LogisticDetection

LogisticDetection.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    metadata=my_table_metadata_dict
)

Parameters

  • (required) real_data: A pandas.DataFrame containing all the compatible columns of the real data

  • (required) synthetic_data: A pandas.DataFrame containing all the compatible columns of the synthetic data

  • metadata: A description of the dataset. See Single Table Metadata

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

  • The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the detection results may not be valid.

  • There are multiple interpretations for this metric. (See the Score section above.)

References

[1] https://en.wikipedia.org/wiki/Receiver_operating_characteristic

[2] https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

[3] https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Last updated