Regression

Regression metrics calculate the success of using synthetic data to perform an ML regression task. Each metric uses a different ML algorithm for the computation:

  • LinearRegression

  • MLPRegressor

Data Compatibility

  • Boolean: This metric convert boolean columns into 0/1 values. Missing values are replaced by the most commonly occurring value (mode).

  • Categorical: These metric convert categorical columns into multiple, one hot encoded columns.

  • Datetime: These metrics convert datetime columns into numerical values using the Unix timestamp. They impute missing values using the mean.

  • Numerical: These metrics are designed to work with numerical columns. They impute missing values using the mean

This metric should not be used with any other column, such as primary keys or anonymized columns.

Score

(best) 1.0: Given the training data with the provided ML algorithm, you will be able to perform ML tasks with 100% accuracy on the test data

(worst) -∞: Given the training data with the provided ML algorithm, you will not be able to predict any of the test data correctly

How does it work?

All ML efficacy metrics perform the same steps:

  1. Train the ML algorithm using the training data (usually synthetic data). The output is an ML model that can predict the value of a given target column.

  2. Test the ML model by making predictions on the testing data (usually real data) and comparing against the actual values.

  3. Return the r2 [1] test score.

Usage

Access this metric from the single_table module and use the compute method.

from sdmetrics.single_table import LinearRegression, MLPRegressor

LinearRegression.compute(
    test_data=real_data,
    train_data=synthetic_data,
    target='numerical_column_name',
    metadata=metadata
)

Parameters

  • (required) test_data: A pandas.DataFrame containing the full data to test on. This should include the column that you are trying to predict.

  • (required) train_data: A pandas.DataFrame containing the full data to train on. This should include the column that you are trying to predict.

  • (required) target: A string representing the name of the column that you want to predict. This must be a numerical column.

  • metadata: A description of the dataset. See Single Table Metadata

FAQs

This is metric is in Beta. Be careful when using the metric and interpreting its score.

  • The score depends on underlying algorithm used to model the data. If the dataset is not suited for a particular machine learning method, then the predicted values may not be valid.

  • Because the score is lower-bounded by -∞, the metric may be hard to interpret.

  • In a real world scenario, you may spend more effort in building an ML model. These metrics only allow you to select from specific algorithms (Linear, MLP)

Which data should I use for training versus testing?

This depends on how you plan to use the synthetic data.

  • A common goal is to replace the real data with synthetic data for ML development. In this case, you can train using the synthetic data and test using the real data.* This is also known as the TSTR (Train Synthetic Test Real) score.

  • Another goal might be to enhance or augment the real dataset for ML development. In this case, you can train on a combined set of real and synthetic data, and then test with other real data*

  • To baseline the difficulty of the ML task, split the real data into a train and test set. This is also knows as the TRTR (Train Real Test Real) score.

*Keep in mind that the synthetic data is itself created using real data. For the most unbiased measurement, test the ML efficacy using real data that was not involved in the synthetic data creation.

What if the column I want to predict is not continuous?

There are other ML Efficacy metrics available for different types of target columns for your ML prediction task.

References

[1] https://en.wikipedia.org/wiki/Coefficient_of_determination

Last updated