ML Augmentation

ML Augmentation metrics capture the value of using synthetic data for the purposes of training an ML model. They assume that you are augmenting the real data with synthetic data to create a more enhanced training set for solving an ML problem.

We hope that the augmented data (real + synthetic) will yield a better ML model than just using the real data by itself. This comparison captures the value of adding synthetic data into your ML workflow. This type of ROI measurement allows you to capture the return-on-investment of using synthetic data for a downstream project.

Synthetic data can be measured in two ways. Much of the focus has been on measuring statistical data differences between the real and synthetic data, such as quality measures. But this is not enough. Synthetic data needs to provide a return-on-investment (ROI) for the task it is ultimately meant to accomplish — whether it's software testing, machine learning development, or more. When possible, it's important to include metrics that measure ROI in your evaluation.

SDMetrics includes metrics for statistical data differences as well as for the ultimate ROI for different tasks. The two may or may not correlate.

Browse

Apply these metrics to evaluate the ROI of synthetic data for ML augmentation:

Last updated