StatisticMSAS

This metric is for sequential data. It measures the similarity between a set of real sequences and a synthetic sequences by computing a summary statistic. Supported summary statistics are: mean, median, standard deviation, min, and max.

Data Compatibility

ID: This metric requires a column of ID values that distinguish between different sequences
Numerical : This metric computes statistics for a continuous, numerical data column

Both columns need to be present for this metric. This metric ignores missing values.

Score

(best) 1.0: The statistic for the real data is exactly the same at the synthetic data

(worst) 0.0: The statistic for the real data is extremely different from the synthetic data

How does it work?

This metric assumes you have an ID column to represent sequences. For example, if you are storing patient health information like below, the Patient ID column represents the sequence ID. You can then compare a statistic across any numerical column such as Systolic BP.

To compute a score, this metric implements the Multi Sequence Aggregate Similarity approach from [1].

It breaks up the real column of numerical values based on the sequence ID and computes a statistic value for each sequence. This yields a distribution of statistics, D_r.
It repeats the process for the synthetic column of numerical values, yielding a separate distribution, D_s.
This metric will then compare the two distributions using the KSComplement metric.

score = KSComplement(D_r, D_s)

Usage

Access this metric from the column_pairs module and use the compute method.

from sdmetrics.column_pairs import StatisticSimilarity

StatisticMSAS.compute(
    real_data=(real_table['Patient ID'], real_table['Systolic BP']),
    synthetic_data=(synthetic_table['Patient ID'], synthetic_table['Systolic BP']),
    statistic='mean'
)

Parameters

(required) real_data: A tuple of pandas.Series object with the column of real data. The first column represents the ID, while the second represents the numerical column.
(required) synthetic_data: A tuple of pandas.Series object with the column of synthetic data. The first column represents the ID, while the second represents the numerical column.
statistic: A string describing the name of the statistical function
- (default) 'mean': The arithmetic mean
- 'median': The median value
- 'sdt': The standard deviation
- 'min': The min value
- 'max': The max value

FAQs

Do the ID values have to match up between the real and synthetic data?

No, the ID values are not expected to be the same between the real and synthetic data because they represent entirely different entities. This metric is computing the overall statistics between the sequences.

Is it better to use the mean or the median statistic?

The mean and median summarize the values differently, especially when your data has a skew [2].

The mean takes all values into account. It may be significantly affected by just a few large or small values. You can use the mean for a fast computation if you know there is no skew in your data or if you are ok with outliers affecting the score.
The median finds a middle value where 50% of the data is larger and 50% is smaller. The median is resilient to outliers in either direction. This may be desirable if you have skewed data, as the score is more representative of the typical values. Note that this computation takes longer.

References

[1] Sequential Models in the Synthetic Data Vault

[2] https://en.wikipedia.org/wiki/Skewness

PreviousSequenceLengthSimilarity NextStatisticSimilarity

Last updated 8 months ago