This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.
- All data: Any data is compatible with this metric as long as it contains missing values
(best) 1.0: The synthetic data perfectly captures the proportion of missing values
(worst) 0.0: The synthetic data has a completely different proportion of missing values than the real data
This test computes the proportion of missing values, p, in both the real and synthetic data, R and S. It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.
Note that the term at the right is equivalent to the Total Variation Distance  of the missing/non-missing values between the real and synthetic data
Access this metric from the
single_columnmodule and use the
from sdmetrics.single_column import MissingValueSimilarity
real_data: A pandas.Series containing a single column with missing values
synthetic_data: A pandas.Series object with the synthetic version of the column
We use the same convention as pandas for determining when a value is missing . Missing values in your data should be represented as
If you are using any special notation to denote missing values, convert them to
NaNvalues before using this metric.