CSTest
This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes. You can think of the shape as what you observe when you plot a bar graph of the column.
Data Compatibility
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on boolean data
This metric does not accept missing values
Score
(best) 1.0: The p-value is high, indicating that the synthetic data is not very different from the real data
(worst) 0.0: The p-value is low, indicating that the synthetic data is significantly different than the real data
How does it work?
This test normalizes the real and synthetic data in order to compute the category frequencies. Then, it applies the Chi-squared test [1] to test the null hypothesis that the synthetic data comes from the same distribution as the real data.
The test returns the p-value [2], where a smaller p-value indicates that the synthetic data is significantly different from the real data, rejecting the null hypothesis and leading to a worse overall score.
Usage
Access this metric from the single_column
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.Series containing a single column(required)
synthetic_data
: A similar pandas.Series object with the synthetic version of the column
FAQs
References
Last updated