InterRowMSAS

This metric is for sequential data. It measures the similarity between a set of real sequences and a synthetic sequences by computing the differences between consecutive rows.

Data Compatibility

ID: This metric requires a column of ID values that distinguish between different sequences
Numerical : This metric computes differences for a continuous, numerical data column

Both columns need to be present for this metric. This metric ignores missing values.

Score

(best) 1.0: The differences between rows is exactly the same for the real and synthetic data

(worst) 0.0: The differences between rows is as different as can be

How does it work?

This metric assumes you have an ID column to represent sequences. For example, if you are storing patient health information like below, the Patient ID column represents the sequence ID. You can then compare a differences for numerical columns such as Systolic BP.

To compute a score, this metric implements the Multi Sequence Aggregate Similarity approach from [1].

It breaks up the real column of numerical values based on the sequence ID and computes the inter-row difference for each sequence. It then averages the difference for each sequence, yielding a distribution D_r.
It repeats the process for the synthetic column of numerical values, yielding a separate distribution, D_s.
This metric will then compare the two distributions using the KSComplement metric.

score = KSComplement(D_r, D_s)

Usage

Access this metric from the column_pairs module and use the compute method.

from sdmetrics.column_pairs import InterRowMSAS

InterRowMSAS.compute(
    real_data=(real_table['Patient ID'], real_table['Systolic BP']),
    synthetic_data=(synthetic_table['Patient ID'], synthetic_table['Systolic BP']),
    statistic='mean'
)

Parameters

(required) real_data: A tuple of pandas.Series object with the column of real data. The first column represents the ID, while the second represents the numerical column.
(required) synthetic_data: A tuple of pandas.Series object with the column of synthetic data. The first column represents the ID, while the second represents the numerical column.
n_rows_diff: An integer representing the number of rows to consider when taking the difference
- (default) 1: Take the difference of a row and the one right before it
- Int > 0: Take the difference between a row n and n + n_rows_diff
apply_log: Whether to apply a natural log before taking the difference
- (default) False: Do not apply a log. This results in the absolute difference, useful when you expect the data to grow or shrink linearly
- True: Apply a log before taking the difference. This is recommended when you expect the data to grow or shrink exponentially

FAQs

This metric is in Beta. Be careful when using the metric and interpreting its score.

Taking the absolute difference between subsequence rows and averaging them out will effectively cancel out all terms besides the first and last. The team is considering alternative implementations:

Do not average out the differences between each sequence. Instead, add the differences to an overall distribution D_r or D_s.
(Similar to taking a log) Apply a transform each number. Eg. Squaring all values, and identifying the square root of the differences -- sqrt((r+x)**2 - (r)**2)

The team is still identifying the pros and cons of each approach.

Do the ID values have to match up between the real and synthetic data?

No, the ID values are not expected to be the same between the real and synthetic data because they represent entirely different entities. This metric is computing the overall statistics between the sequences.

References

[1] Sequential Models in the Synthetic Data Vault

PreviousEqualizedOddsImprovement NextML Efficacy: Sequential

Last updated 8 months ago