* OutlierCoverage

*SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. To learn more about the SDV Enterprise and its extra features, visit our website.

This metric measures whether the synthetic data contains outliers that were present in the real data. It evaluates a failure mode where the synthetic data does not contain any outliers.

Data Compatibility

  • Numerical : This metric is meant for numerical data

  • Datetime : This metric converts datetime values into numerical values

This metric ignores missing values.

This metric is designed for data that contains outliers. We assume that the real data contains outliers, or else the metric is undefined.

Score

(best) 1.0: The synthetic data fully covers the outlier regions that are in the real data

(worst) 0.0: The synthetic data does not contain any outliers

How does it work?

This metric first finds outliers in the real data (R) using the interquartile range (IQR) [1]. Any data that is 1.5× lower than Q1 is considered a left outlier and any data that is 1.5× higher than Q3 is considered a right outlier.

The metric uses the computed IQR to find outliers in the synthetic data (S). It then compares the proportion of data points in the outlier ranges between the real data (R) and synthetic data (S) to return a final score.

score = \text{min}\left(\frac{p_S}{p_R}, 1\right)\text{, where }p = \frac{\text{# outlier points}}{\text{total # data points}}

Usage

To apply this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import OutlierCoverage

OutlierCoverage.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Technical Note: What is captured by this metric?

The OutlierCoverage score describes whether the synthetic data generally has data points in the outlier regions. But it does not tell us anything about the shape of the synthetic data. In the example below, the OutlierCoverage score is 1.0 because the synthetic data has plenty of data points in the outlier regions (red). However, the synthetic data is not the same shape as the real data.

In this case, the synthetic data is smoother than the real data, which is why there are many data points in the outlier regions. This can be beneficial for certain uses. To quantify this pattern, see the SmoothnessSimilarity metric.

References

[1] Interquartile Range, Outliers

Last updated