StatisticSimilarity
This metric measures the similarity between a real column and a synthetic column by comparing a summary statistic. Supported summary statistics are: mean, median and standard deviation.
Data Compatibility
Numerical : This metric is meant for continuous, numerical data
Datetime : This metric converts datetime values into numerical values
This metric ignores missing values.
Score
(best) 1.0: The statistic for the real data is exactly the same at the synthetic data
(worst) 0.0: The statistic for the real data is extremely different from the synthetic data
How does it work?
This test computes the given statistical function, f, for the real data and synthetic columns, r and s. Then, the test normalizes the score by scaling and taking its complement. This create a score that falls within the [0, 1] range*, where a high value means high similarity.
The supported statistical functions (f) are: the (arithmetic) mean, median and standard deviation.
*In rare cases, where the synthetic data statistic is very different from the real data, the computed score may be negative. In such cases we clip the score to 0, the worst possible score.
Usage
Access this metric from the single_column
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.Series object with the column of real data(required)
synthetic_data
: A pandas.Series object with the column of synthetic datastatistic
: A string describing the name of the statistical function(default)
'mean'
: The arithmetic mean'median'
: The median value'std'
: The standard deviation
FAQs
References
Last updated