＊SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. To learn more about the SDV Enterprise and its extra features, get in touch with us.
This metric measures whether the synthetic data contains outliers that were present in the real data. It evaluates a failure mode where the synthetic data does not contain any outliers.
- Numerical : This metric is meant for numerical data
- Datetime : This metric converts datetime values into numerical values
This metric ignores missing values.
This metric is designed for data that contains outliers. We assume that the real data contains outliers, or else the metric is undefined.
(best) 1.0: The synthetic data fully covers the outlier regions that are in the real data
(worst) 0.0: The synthetic data does not contain any outliers
This metric first finds outliers in the real data (R) using the interquartile range (IQR) . Any data that is 1.5× lower than Q1 is considered a left outlier and any data that is 1.5× higher than Q3 is considered a right outlier.
In this example, we're computing the IQR for a distribution of real data (black). The quartiles are shown in a box plot underneath the distribution. Areas that are <1.5×IQR and >1.4×IQR are considered outliers, as shown in the red boxes.
The metric uses the computed IQR to find outliers in the synthetic data (S). It then compares the proportion of data points in the outlier ranges between the real data (R) and synthetic data (S) to return a final score.
To apply this metric, access the
single_columnmodule and use the
from sdmetrics.single_column import OutlierCoverage
real_data: A pandas.Series object with the column of real data
synthetic_data: A pandas.Series object with the column of synthetic data
Technical Note: What is captured by this metric?
The OutlierCoverage score describes whether the synthetic data generally has data points in the outlier regions. But it does not tell us anything about the shape of the synthetic data. In the example below, the OutlierCoverage score is 1.0 because the synthetic data has plenty of data points in the outlier regions (red). However, the synthetic data is not the same shape as the real data.
In this case, the synthetic data is smoother than the real data, which is why there are many data points in the outlier regions. This can be beneficial for certain uses. To quantify this pattern, see the SmoothnessSimilarity metric.