MissingValueSimilarity

This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.

Data Compatibility

All data: Any data is compatible with this metric as long as it contains missing values

Score

(best) 1.0: The synthetic data perfectly captures the proportion of missing values

(worst) 0.0: The synthetic data has a completely different proportion of missing values than the real data

How does it work?

This test computes the proportion of missing values, p, in both the real and synthetic data, R and S. It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.

score = 1 - |S_p - R_p|

Note that the term at the right is equivalent to the Total Variation Distance [1] of the missing/non-missing values between the real and synthetic data

Usage

Access this metric from the single_column module and use the compute method.

from sdmetrics.single_column import MissingValueSimilarity

MissingValueSimilarity.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

(required) real_data: A pandas.Series containing a single column with missing values
(required) synthetic_data: A pandas.Series object with the synthetic version of the column

FAQs

What kind of values count as missing?

We use the same convention as pandas for determining when a value is missing [2]. Missing values in your data should be represented as NaN objects.

If you are using any special notation to denote missing values, convert them to NaN values before using this metric.

References

[1] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

[2] https://pandas.pydata.org/docs/user_guide/missing_data.html

PreviousKSComplement NextRangeCoverage

Last updated 2 years ago