ML Efficacy: Sequential

LSTMClassifierEfficacy calculates the success of using synthetic data to perform an ML prediction task.

Data Compatibility

  • Boolean: This metric converts boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).

  • Categorical: This metric converts categorical columns into multiple, one hot encoded columns.

  • Datetime: This metric converts datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.

  • Numerical: This metrics are designed to work with numerical columns. They impute missing values using the mean

  • Sequence Key: It's important to include the sequence key to tell which rows should belong to which sequences.

This metric should not be used with any other column, such as primary keys or anonymized columns.

Score

(best) ∞: Using the synthetic data to train for an ML prediction task, you will be able to perform the infinitely better than if you used the real data.

(great) score > 1.0: Using the synthetic data to train for an ML prediction task, you will be able to perform the task better than if you used the real data. The score determines the multiplier. For example if the score=3, then it means you'll perform the task 3x better.

(good) score = 1.0: Using the synthetic data to train for an ML prediction task, you will be able to perform the task just as well as if you used the real data.

(worse) score < 1.0: Using the synthetic data to train for an ML prediction task, you will not be able to perform the task as well as if you used the real data. The score determines how much worse the synthetic data performs. For example, if the score=0.8, then it means you'll perform the task 80% as well.

(worst) 0.0: Using the synthetic data to train an ML model, you will not be able to successfully perform the ML prediction task at all -- it will always predict the wrong value.

How does it work?

This metric performs multiple steps to calculate the ML efficacy:

  1. Split the real data into a real test set (25% of rows) and a real training set (75%) of rows.

  2. Create a long short-term memory (LSTM) neural network [1]. Train it on the real training set and compute it prediction score for the real test set. This is the real score.

  3. Create another long short-term memory (LSTM) neural network [1]. Train it on the synthetic data and compute it prediction score for the real test set. This is the synthetic score.

  4. The final metric score is the ratio of the two previous scores.

final score=synthetic scorereal score\text{final score} = \frac{\text{synthetic score}}{\text{real score}}

​Usage

Access this metric from the timeseries module and use the compute method.

from sdmetrics.timeseries import LSTMClassifierEfficacy

LSTMClassifierEfficacy.compute(
    real_data=real_table,
    synthetic_data=synthetic_data,
    metadata=my_sequential_metadata_dict,
    target='Heart Rate'
)

Parameters

  • (required) real_data: A pandas.DataFrame containing all the compatible columns of the real data

  • (required) synthetic_data: A pandas.DataFrame containing all the compatible columns of the synthetic data

  • metadata: A description of the dataset. See Sequential Metadata.

  • sequence_key: A list that describe the names of columns in the sequence key. The sequence key are used to identify which rows belong to which sequences.

  • (required) target: A string representing the name of the column that you want to predict.

The target column must be discrete (categorical or boolean).

FAQs

This is metric is in Beta. Be careful when using the metric and interpreting its score.

  • The score depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.

  • The score values may be hard to interpret because it is a ratio. The range of scores [0.0, 1.0) indicate that using the synthetic data is worse than the real data. while the range of scores [1.0, ∞) indicate that it's better. Additionally, the score does not allow you to easily interpret the baseline accuracy of the overall task (real score only).

Which data should I use for training and testing? Can I change this?

This metric assumes that you will replace the real data entirely with the synthetic data* for ML development and training. If this is not the case, you can alter the datasets that you input based on your use case.

For example, your goal may be to enhance or augment the real dataset for ML development. In this case, you can input the combined set of real and synthetic data instead of just the synthetic data.

*Keep in mind that the synthetic data is itself created using real data. For the most unbiased measurement, test the ML efficacy using real data that was not involved in the synthetic data creation.

References

[1] https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

Last updated