SequenceLengthSimilarity
Last updated
Last updated
This metric is for sequential data. It measures the similarity between a real and synthetic column in terms of the length of sequences that they represent.
ID : This metric is meant for a column that represents sequence IDs. The IDs are used to distinguish between different sequences.
This metric ignores missing values.
(best) 1.0: The sequence lengths in the synthetic data are exactly the same as the real data
(worst) 0.0: The sequence lengths in the synthetic data are as different as can be from the real data
This metric assumes you have an ID column to represent sequences. For example, if you are storing different sequences of patient health information, the Patient ID
column represents the sequence ID. The length of a sequence is determined by how often an ID value repeats.
This metric first computes the length of each sequence in the real data. Since you may have multiple sequences, this will form a distribution of real data, D_r. The metric will then compute the same for the synthetic data, forming a different distribution, D_s.
This metric will then compare the two distributions using the KSComplement metric.
Access this metric from the single_column
module and use the compute
method.
Parameters
(required) real_data
: A pandas.Series object with the column of real data
(required) synthetic_data
: A pandas.Series object with the column of synthetic data