BinaryClassifierRecallEfficacy

This metric measures how valuable synthetic data augmentation can be for solving a binary prediction problem.

This metric combines the real and synthetic data to created an augmented dataset. Then, it calculates whether the augmented data can improve the performance of an ML classifier (as opposed to just using real data).

Use this metric when you're interested in improving the recall score of the ML classifier.

Data Compatibility

Numerical: This metric is meant for numerical data
Datetime: This metric works on datetime data by considering the timestamps as continuous values
Categorical: This metric works on categorical data by encoding it as numerical data
Boolean: This metric works on boolean data by encoding it as numerical data

This metric ignores missing values.

Score

(best) 1.0: Augmenting the real data with synthetic data improves the ML classifier's recall by the most it possibly can (100%)

(baseline) 0.5: Augmenting the real data with synthetic data does not change the ML classifier's recall at all

(worst) 0.0: Augmenting the real data with synthetic data decreases the ML classifier's recall by the most it possibly can (by 100%).

Is synthetic data improving my ML classifier's recall? Any score >0.5 indicates that synthetic data is having a positive impact on ML classifier. This should be considered a success.

How does it work?

This metric combines the real and synthetic data together to form augmented data. It trains an ML classifier on the augmented data to predict a binary classification problem (for a fixed precision value). It then evaluates the recall of the ML classifier using a validation set [1]. We call this score the augmented_score.

Next, it repeats the same process but without any synthetic data. (Meaning that it only uses the real data to train the ML classifier.) This forms the baseline_score.

The final score is the difference between the augmented and baseline scores, adjusted to provide a value in the [0, 1] range.

score = \frac{augmented\_score - baseline\_score}{2} + 0.5

Usage

Access this metric from the single_table.data_augmentation module and use the compute_breakdown method.

from sdmetrics.single_table.data_augmentation import BinaryClassifierRecallEfficacy

score = BinaryClassifierRecallEfficacy.compute_breakdown(
    real_training_data=real_table,
    synthetic_data=synthetic_table,
    real_validation_data=real_holdout_set,
    metadata=single_table_metadata_dict,
    prediction_column_name='covid_status',
    minority_class_label=1,
    classifier='XGBoost',
    fixed_precision_value=0.9
)

Parameters

(required) real_training_data: A pandas.DataFrame object containing the real data that you used for training your synthesizer. This metric will use this data for training a Binary Classification model.
(required) synthetic_data: A pandas.DataFrame object containing the synthetic data you sampled from your synthesizer. This metric will use this data for training a Binary Classification model
(required) real_validation_data: A pandas.DataFrame object containing a holdout set of real data. This data should not have been used to train your synthesizer. This metric will use this data for evaluating a Binary Classification model
(required) metadata: A metadata dictionary that describes the table of data
(required) prediction_column_name: A string with the name of the column you are interested in predicting. This should be either a categorical or boolean column.
(required) minority_class_label: The value that you are considering to be a positive result, from the perspective of Binary Classification. All other values in this column will be considered negative results.
classifier: A string describing the ML algorithm to use when building Binary Classification. Supported options are:
- (default) 'XGBoost': Use gradient boost from the XGBoost library [2]
- Support for additional classifiers is coming in future releases
fixed_precision_value: A float describing the value to fix for the precision, when building the Binary Classification model
- (default) 0.9: Fix the precision at 90%
- float: Fix the precision at the given value. This must be in the range (0, 1.0).

The compute_breakdown method returns a dictionary containing the overall score, as well as the individual recall scores for the augmented data and real data baseline.

{
  'score': 0.7891,
  'augmented_data': {
    'recall_score_training': 0.950,
    'recall_score_validation': 0.912
    'recall_score_validation': 0.84,
    'prediction_counts_validation': {
      'true_positive': 21,
      'false_positive': 4,
      'true_negative': 73,
      'false_negative': 3
    },
  },
  'real_data_baseline': {
    # keys are the same as the 'augmented_data' dictionary
   },
  'parameters': {
    'prediction_column_name': 'covid_status',
    'minority_class_label': 1,
    'classifier': 'XGBoost',
    'fixed_recall_value': 0.9
  }
}

FAQs

What is the purpose of fixing a precision value?

For many important binary prediction problems, there is typically a minority class that occurs very rarely but is critical to predict. For example, fraudulent credit card transactions, or healthcare patients that test positive for a disease.

For such problems, the precision describes whether the labels predicted to be positive actually are positive. This is important to fix because very few of these labels are positive. the recall then describes how many of the overall positive labels we are able to predict.

Precision and recall are a tradeoff in the accuracy of predictions vs. the completeness of predictions. Fixing the precision value allows you to compare the recall scores more easily.

If you'd like to fix the recall value instead, use the BinaryClassifierPrecisionEfficacy metric instead.

Why is the actual precision value different from the one I supplied?

This metric will try to fix the precision value at the exact value that you supply. However based on the number of data points you have, it may not be possible to achieve that exact value.

This metric chooses a precision value that is as close as possible to the requested value without going under. For example, if you fix the precision at 90%, you may see an actual precision of 92%.

What is the purpose of a real data baseline?

In this scenario, you are interested in measuring the value of the synthetic data. The assumption is that augmenting the real and synthetic data will allow you to build a better ML classifier than just using the real data alone. The real data is the baseline.

Without the real data baseline, it's hard to interpret ML classifier's outputs. The ML classifier may be better at classifying some datasets over others, which means that using synthetic data can have varied effects.

If I don't have a validation set, can I just use part of the real data?

The purpose of a validation set is to measure the ML classifier's predictions. It's very important that the validation data must never have been used to create a synthesizer/synthetic data. Otherwise the synthetic data can leak patterns of the validation set into your ML classifier.

So if all of your real data was used to create the synthetic data, it is not possible to use any part of it as your validation set.

References

[1] https://en.wikipedia.org/wiki/Precision_and_recall

[2] https://xgboost.readthedocs.io/en/stable/

PreviousBinaryClassifierPrecisionEfficacy NextMetrics in Beta

Last updated 5 months ago