SequenceLengthSimilarity

This metric is for sequential data. It measures the similarity between a real and synthetic column in terms of the length of sequences that they represent.

Data Compatibility

ID : This metric is meant for a column that represents sequence IDs. The IDs are used to distinguish between different sequences.

This metric ignores missing values.

Score

(best) 1.0: The sequence lengths in the synthetic data are exactly the same as the real data

(worst) 0.0: The sequence lengths in the synthetic data are as different as can be from the real data

How does it work?

This metric assumes you have an ID column to represent sequences. For example, if you are storing different sequences of patient health information, the Patient ID column represents the sequence ID. The length of a sequence is determined by how often an ID value repeats.

This metric first computes the length of each sequence in the real data. Since you may have multiple sequences, this will form a distribution of real data, D_r. The metric will then compute the same for the synthetic data, forming a different distribution, D_s.

This metric will then compare the two distributions using the KSComplement metric.

score = KSComplement(D_r, D_s)

Usage

Access this metric from the single_column module and use the compute method.

from sdmetrics.single_column import SequenceLengthSimilarity

SequenceLengthSimilarity.compute(
    real_data=real_table['id_column'],
    synthetic_data=synthetic_table['id_column']
)

Parameters

(required) real_data: A pandas.Series object with the column of real data
(required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Do the ID values have to match up between the real and synthetic data?

No, the ID values are not expected to be the same between the real and synthetic data because they represent entirely different entities. This metric is computing the lengths of the sequences.

PreviousRangeCoverage NextStatisticMSAS

Last updated 1 month ago