This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.
- Numerical : This metric is meant for continuous, numerical data
- Datetime : This metric converts datetime values into numerical values
This metric ignores missing values.
(best) 1.0: The real data is exactly the same as the synthetic data
(worst) 0.0: The real and synthetic data are as different as they can be
The graphs below show two examples with real and synthetic data (black and green). At the left, the synthetic data is similar to the real data so the score is close to 1. At the right, the shapes are different so the score is lower.
The KSComplement uses the Kolmogorov-Smirnov statistic . To compute this statistic, we convert a numerical distribution into its cumulative distribution function (CDF) . The KS statistic is the maximum difference between the two CDFs, as shown below.
The distance is a value between 0 and 1. In SDMetrics, we invert the statistic: The KSComplement returns
1-(KS statistic)so that a higher score means higher quality.
Recommended Usage: The Quality Report applies this metric every compatible column and provides visualizations to understand the score.
To manually run this metric, access the
single_columnmodule and use the
from sdmetrics.single_column import KSComplement
real_data: A pandas.Series containing a single column
synthetic_data: A similar pandas.Series object with the synthetic version of the column