InterRowMSAS
Last updated
Last updated
This metric is for sequential data. It measures the similarity between a set of real sequences and a synthetic sequences by computing the differences between consecutive rows.
ID: This metric requires a column of ID values that distinguish between different sequences
Numerical : This metric computes differences for a continuous, numerical data column
Both columns need to be present for this metric. This metric ignores missing values.
(best) 1.0: The differences between rows is exactly the same for the real and synthetic data
(worst) 0.0: The differences between rows is as different as can be
This metric assumes you have an ID column to represent sequences. For example, if you are storing patient health information like below, the Patient ID
column represents the sequence ID. You can then compare a differences for numerical columns such as Systolic BP
.
To compute a score, this metric implements the Multi Sequence Aggregate Similarity approach from [1].
It breaks up the real column of numerical values based on the sequence ID and computes the inter-row difference for each sequence. It then averages the difference for each sequence, yielding a distribution D_r.
It repeats the process for the synthetic column of numerical values, yielding a separate distribution, D_s.
This metric will then compare the two distributions using the KSComplement metric.
Access this metric from the column_pairs
module and use the compute
method.
Parameters
(required) real_data
: A tuple of pandas.Series object with the column of real data. The first column represents the ID, while the second represents the numerical column.
(required) synthetic_data
: A tuple of pandas.Series object with the column of synthetic data. The first column represents the ID, while the second represents the numerical column.
n_rows_diff
: An integer representing the number of rows to consider when taking the difference
(default) 1
: Take the difference of a row and the one right before it
Int > 0: Take the difference between a row n
and n + n_rows_diff
apply_log
: Whether to apply a natural log before taking the difference
(default) False
: Do not apply a log. This results in the absolute difference, useful when you expect the data to grow or shrink linearly
True
: Apply a log before taking the difference. This is recommended when you expect the data to grow or shrink exponentially
This metric is in Beta. Be careful when using the metric and interpreting its score.
Taking the absolute difference between subsequence rows and averaging them out will effectively cancel out all terms besides the first and last. The team is considering alternative implementations:
Do not average out the differences between each sequence. Instead, add the differences to an overall distribution D_r or D_s.
(Similar to taking a log) Apply a transform each number. Eg. Squaring all values, and identifying the square root of the differences -- sqrt((r+x)**2 - (r)**2)
The team is still identifying the pros and cons of each approach.