ML Efficacy: Sequential
LSTMClassifierEfficacy
calculates the success of using synthetic data to perform an ML prediction task.
Data Compatibility
Boolean: This metric converts boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).
Categorical: This metric converts categorical columns into multiple, one hot encoded columns.
Datetime: This metric converts datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.
Numerical: This metrics are designed to work with numerical columns. They impute missing values using the mean
Sequence Key: It's important to include the sequence key to tell which rows should belong to which sequences.
This metric should not be used with any other column, such as primary keys or anonymized columns.
Score
(best) ∞: Using the synthetic data to train for an ML prediction task, you will be able to perform the infinitely better than if you used the real data.
(great) score > 1.0: Using the synthetic data to train for an ML prediction task, you will be able to perform the task better than if you used the real data. The score determines the multiplier. For example if the score=3, then it means you'll perform the task 3x better.
(good) score = 1.0: Using the synthetic data to train for an ML prediction task, you will be able to perform the task just as well as if you used the real data.
(worse) score < 1.0: Using the synthetic data to train for an ML prediction task, you will not be able to perform the task as well as if you used the real data. The score determines how much worse the synthetic data performs. For example, if the score=0.8, then it means you'll perform the task 80% as well.
(worst) 0.0: Using the synthetic data to train an ML model, you will not be able to successfully perform the ML prediction task at all -- it will always predict the wrong value.
How does it work?
This metric performs multiple steps to calculate the ML efficacy:
Split the real data into a real test set (25% of rows) and a real training set (75%) of rows.
Create a long short-term memory (LSTM) neural network [1]. Train it on the real training set and compute it prediction score for the real test set. This is the real score.
Create another long short-term memory (LSTM) neural network [1]. Train it on the synthetic data and compute it prediction score for the real test set. This is the synthetic score.
The final metric score is the ratio of the two previous scores.
Usage
Access this metric from the timeseries
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.DataFrame containing all the compatible columns of the real data(required)
synthetic_data
: A pandas.DataFrame containing all the compatible columns of the synthetic datametadata
: A description of the dataset. See Sequential Metadata.sequence_key
: A list that describe the names of columns in the sequence key. The sequence key are used to identify which rows belong to which sequences.(required)
target
: A string representing the name of the column that you want to predict.
The target
column must be discrete (categorical or boolean).
FAQs
This is metric is in Beta. Be careful when using the metric and interpreting its score.
The score depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.
The score values may be hard to interpret because it is a ratio. The range of scores [0.0, 1.0) indicate that using the synthetic data is worse than the real data. while the range of scores [1.0, ∞) indicate that it's better. Additionally, the score does not allow you to easily interpret the baseline accuracy of the overall task (real score only).
References
[1] https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
Last updated