LSTMDetection calculates how difficult it is to tell apart the real, sequential data from the synthetic data. This is done using a neural network.
- Boolean: This metric convert boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).
- Categorical: These metric convert categorical columns into multiple, one hot encoded columns.
- Datetime: These metrics convert datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.
- Numerical: These metrics are designed to work with numerical columns. They impute missing values using the mean
- Entity columns: It's important to include entity columns to tell which rows should be long to which sequences.
This metric should not be used with any other column, such as primary keys or anonymized columns.
(highest) 1.0: The machine learning model cannot tell apart any of the real and synthetic rows
(lowest) 0.0: The machine learning model can correctly identify all the real and synthetic rows
This detection metric runs through the following steps:
- 1.Create a single, augmented table that has all the rows of real data and all the rows of synthetic data. Add an extra column to keep track of whether each original row is real or synthetic.
- 2.Split the augmented data to create a training and validation sets.
- 3.Create a long short-term memory (LSTM) neural network . Train it on the training split. The neural network will predict whether each row is real or synthetic (ie predict the extra column we created in step #1)
- 4.Validate the model on the validation set
- 5.Repeat steps #2-4 multiple times
The final score is: 1 - average ROC AUC score  across all the cross validation splits.
Access this metric from the
timeseriesmodule and use the
from sdmetrics.timeseries import LSTMDetection
real_data: A pandas.DataFrame containing all the compatible columns of the real data
synthetic_data: A pandas.DataFrame containing all the compatible columns of the synthetic data
entity_columns: A list of strings that describe the names of entity columns. Entity columns are used to identify which rows belong to which sequences.