Search…
⌃K
Links

KSComplement

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Data Compatibility

  • Numerical : This metric is meant for continuous, numerical data
  • Datetime : This metric converts datetime values into numerical values
This metric ignores missing values.

Score

(best) 1.0: The real data is exactly the same as the synthetic data
(worst) 0.0: The real and synthetic data are as different as they can be
The graphs below show two examples with real and synthetic data (black and green). At the left, the synthetic data is similar to the real data so the score is close to 1. At the right, the shapes are different so the score is lower.

How does it work?

The KSComplement uses the Kolmogorov-Smirnov statistic [1]. To compute this statistic, we convert a numerical distribution into its cumulative distribution function (CDF) [2]. The KS statistic is the maximum difference between the two CDFs, as shown below.
The distance is a value between 0 and 1. In SDMetrics, we invert the statistic: The KSComplement returns 1-(KS statistic) so that a higher score means higher quality.

Usage

Recommended Usage: The Quality Report applies this metric every compatible column and provides visualizations to understand the score.
To manually run this metric, access the single_column module and use the compute method.
from sdmetrics.single_column import KSComplement
KSComplement.compute(
real_data=real_table['column_name'],
synthetic_data=synthetic_table['column_name']
)
Parameters
  • (required) real_data: A pandas.Series containing a single column
  • (required) synthetic_data: A similar pandas.Series object with the synthetic version of the column

FAQs

Use the TVComplement metric as the counterpart to this test for categorical and boolean columns.

References