MissingValueSimilarity
This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.
Data Compatibility
All data: Any data is compatible with this metric as long as it contains missing values
Score
(best) 1.0: The synthetic data perfectly captures the proportion of missing values
(worst) 0.0: The synthetic data has a completely different proportion of missing values than the real data
How does it work?
This test computes the proportion of missing values, p, in both the real and synthetic data, R and S. It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.
Note that the term at the right is equivalent to the Total Variation Distance [1] of the missing/non-missing values between the real and synthetic data
Usage
Access this metric from the single_column
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.Series containing a single column with missing values(required)
synthetic_data
: A pandas.Series object with the synthetic version of the column
FAQs
References
Last updated