# MissingValueSimilarity

This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.

**All data**: Any data is compatible with this metric as long as it contains missing values

**(best) 1.0**: The synthetic data perfectly captures the proportion of missing values

**(worst) 0.0**: The synthetic data has a completely different proportion of missing values than the real data

This test computes the proportion of missing values,

*p*, in both the real and synthetic data,*R*and*S.*It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.$score = 1 - |S_p - R_p|$

Note that the term at the right is equivalent to the Total Variation Distance [1] of the missing/non-missing values between the real and synthetic data

Access this metric from the

`single_column`

module and use the `compute`

method.from sdmetrics.single_column import MissingValueSimilarity

MissingValueSimilarity.compute(

real_data=real_table['column_name'],

synthetic_data=synthetic_table['column_name']

)

**Parameters**

- (required)
`real_data`

: A pandas.Series containing a single column with missing values - (required)
`synthetic_data`

: A pandas.Series object with the synthetic version of the column

We use the same convention as pandas for determining when a value is missing [2]. Missing values in your data should be represented as

`NaN`

objects.If you are using any special notation to denote missing values, convert them to

`NaN`

values before using this metric.Last modified 4mo ago