BinaryClassifierPrecisionEfficacy
This metric measures how valuable synthetic data augmentation can be for solving a binary prediction problem.
This metric combines the real and synthetic data to created an augmented dataset. Then, it calculates whether the augmented data can improve the performance of an ML classifier (as opposed to just using real data).
Use this metric when you're interested in improving the precision score of the ML classifier.
Data Compatibility
Numerical: This metric is meant for numerical data
Datetime: This metric works on datetime data by considering the timestamps as continuous values
Categorical: This metric works on categorical data by encoding it as numerical data
Boolean: This metric works on boolean data by encoding it as numerical data
This metric ignores missing values.
Score
(best) 1.0: Augmenting the real data with synthetic data improves the ML classifier's precision by the most it possibly can (100%)
(baseline) 0.5: Augmenting the real data with synthetic data does not change the ML classifier's precision at all
(worst) 0.0: Augmenting the real data with synthetic data decreases the ML classifier's precision by the most it possibly can (by 100%).
Is synthetic data improving my ML classifier's precision? Any score >0.5 indicates that synthetic data is having a positive impact on ML classifier. This should be considered a success.
How does it work?
This metric combines the real and synthetic data together to form augmented data. It trains an ML classifier on the augmented data to predict a binary classification problem (for a fixed recall value). It then evaluates the precision of the ML classifier using a validation set [1]. We call this score the augmented_score.
Next, it repeats the same process but without any synthetic data. (Meaning that it only uses the real data to train the ML classifier.) This forms the baseline_score.
The final score is the difference between the augmented and baseline scores, adjusted to provide a value in the [0, 1]
range.
Usage
Access this metric from the single_table.data_augmentation
module and use the compute_breakdown
method.
Parameters
(required)
real_training_data
: A pandas.DataFrame object containing the real data that you used for training your synthesizer. This metric will use this data for training a Binary Classification model.(required)
synthetic_data
: A pandas.DataFrame object containing the synthetic data you sampled from your synthesizer. This metric will use this data for training a Binary Classification model(required)
real_validation_data
: A pandas.DataFrame object containing a holdout set of real data. This data should not have been used to train your synthesizer. This metric will use this data for evaluating a Binary Classification model(required)
metadata
: A metadata dictionary that describes the table of data(required)
prediction_column_name
: A string with the name of the column you are interested in predicting. This should be either a categorical or boolean column.(required)
minority_class_label
: The value that you are considering to be a positive result, from the perspective of Binary Classification. All other values in this column will be considered negative results.classifier
: A string describing the ML algorithm to use when building Binary Classification. Supported options are:(default)
'XGBoost'
: Use gradient boost from the XGBoost library [2]Support for additional classifiers is coming in future releases
fixed_recall_value
: A float describing the value to fix for the recall, when building the Binary Classification model(default)
0.9
: Fix the recall at 90%float
: Fix the recall at the given value. This must be in the range (0, 1.0).
The compute_breakdown
method returns a dictionary containing the overall score, as well as the individual precision scores for the augmented data and real data baseline.
FAQs
References
Last updated