Detection: Sequential

LSTMDetection calculates how difficult it is to tell apart the real, sequential data from the synthetic data. This is done using a neural network.

Data Compatibility

This metric is meant for sequential data that represents an ordered sequence of rows such as timeseries. It is optimized for cases where there are multiple sequences in each of the real and synthetic datasets -- and the synthetic sequences do not have a clear 1-1 mapping with real sequences.

Boolean: This metric convert boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).
Categorical: These metric convert categorical columns into multiple, one hot encoded columns.
Datetime: These metrics convert datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.
Numerical: These metrics are designed to work with numerical columns. They impute missing values using the mean
Sequence Key: It's important to include the sequence key to tell which rows should be long to which sequences.

This metric should not be used with any other column, such as primary keys or anonymized columns.

Score

(highest) 1.0: The machine learning model cannot identify the synthetic data apart from the real data

(lowest) 0.0: The machine learning model can perfectly identify synthetic data apart from the real data

How does it work?

This detection metric runs through the following steps:

Create a single, augmented table that has all the rows of real data and all the rows of synthetic data. Add an extra column to keep track of whether each original row is real or synthetic.
Split the augmented data to create a training and validation sets.
Create a long short-term memory (LSTM) neural network [1]. Train it on the training split. The neural network will predict whether each row is real or synthetic (ie predict the extra column we created in step #1)
Validate the model on the validation set

The final score is based on accuracy of the classifier.

score = 1 - accuracy

Usage

Access this metric from the timeseries module and use the compute method.

from sdmetrics.timeseries import LSTMDetection

LSTMDetection.compute(
    real_data=real_table,
    synthetic_data=synthetic_table,
    metadata=metadata
)

Parameters

(required) real_data: A pandas.DataFrame containing all the compatible columns of the real data
(required) synthetic_data: A pandas.DataFrame containing all the compatible columns of the synthetic data
metadata: A description of the dataset. See Sequential Metadata.
sequence_key: A list that describe the names of columns in the sequence key. The sequence key are used to identify which rows belong to which sequences.

This metric requires sequence keys. These can either be passed in through metadata or through the sequence_key parameter.

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

The score heavily depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the detection results may not be valid.
There are multiple interpretations for this metric. (See the Score section above.) Of course, this is heavily dependent on how well we trust the algorithm to model the real data.

References

[1] https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

PreviousGMLikelihood NextDetection: Single Table

Last updated 2 years ago