StatisticMSAS
Last updated
Last updated
This metric is for sequential data. It measures the similarity between a set of real sequences and a synthetic sequences by computing a summary statistic. Supported summary statistics are: mean, median, standard deviation, min, and max.
ID: This metric requires a column of ID values that distinguish between different sequences
Numerical : This metric computes statistics for a continuous, numerical data column
Both columns need to be present for this metric. This metric ignores missing values.
(best) 1.0: The statistic for the real data is exactly the same at the synthetic data
(worst) 0.0: The statistic for the real data is extremely different from the synthetic data
This metric assumes you have an ID column to represent sequences. For example, if you are storing patient health information like below, the Patient ID
column represents the sequence ID. You can then compare a statistic across any numerical column such as Systolic BP
.
To compute a score, this metric implements the Multi Sequence Aggregate Similarity approach from [1].
It breaks up the real column of numerical values based on the sequence ID and computes a statistic value for each sequence. This yields a distribution of statistics, D_r.
It repeats the process for the synthetic column of numerical values, yielding a separate distribution, D_s.
This metric will then compare the two distributions using the KSComplement metric.
Access this metric from the column_pairs
module and use the compute
method.
Parameters
(required) real_data
: A tuple of pandas.Series object with the column of real data. The first column represents the ID, while the second represents the numerical column.
(required) synthetic_data
: A tuple of pandas.Series object with the column of synthetic data. The first column represents the ID, while the second represents the numerical column.
statistic
: A string describing the name of the statistical function
(default) 'mean'
: The arithmetic mean
'median'
: The median value
'sdt'
: The standard deviation
'min'
: The min value
'max'
: The max value