# KSComplement

Last updated

Last updated

This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the *marginal distribution* or 1D histogram of the column.

Data Compatibility

**Numerical**: This metric is meant for continuous, numerical data**Datetime**: This metric converts datetime values into numerical values

This metric ignores missing values.

Score

**(best) 1.0**: The real data is exactly the same as the synthetic data

**(worst) 0.0**: The real and synthetic data are as different as they can be

The graphs below show two examples with real and synthetic data (black and green). At the left, the synthetic data is similar to the real data so the score is close to 1. At the right, the shapes are different so the score is lower.

How does it work?

The KSComplement uses the Kolmogorov-Smirnov statistic [1]. To compute this statistic, we convert a numerical distribution into its cumulative distribution function (CDF) [2]. The KS statistic is the maximum difference between the two CDFs, as shown below.

The distance is a value between 0 and 1. In SDMetrics, we invert the statistic: The KSComplement returns `1-(KS statistic)`

so that a higher score means higher quality.

Usage

**Recommended Usage:** The Quality Report applies this metric every compatible column and provides visualizations to understand the score.

To manually run this metric, access the `single_column`

module and use the `compute`

method.

**Parameters**

(required)

`real_data`

: A pandas.Series containing a single column(required)

`synthetic_data`

: A similar pandas.Series object with the synthetic version of the column

FAQs

References