# MissingValueSimilarity

This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.

## Data Compatibility

* **All data**: Any data is compatible with this metric as long as it contains missing values

## Score

**(best) 1.0**: The synthetic data perfectly captures the proportion of missing values

**(worst) 0.0**: The synthetic data has a completely different proportion of missing values than the real data

## How does it work?

This test computes the proportion of missing values, *p*, in both the real and synthetic data, *R* and *S.* It normalizes them and returns a similarity score in the range \[0, 1], with 1 representing the highest similarity.&#x20;

$$
score = 1 - |S\_p - R\_p|
$$

Note that the term at the right is equivalent to the Total Variation Distance \[1] of the missing/non-missing values between the real and synthetic data

## Usage

Access this metric from the `single_column` module and use the `compute` method.

```python
from sdmetrics.single_column import MissingValueSimilarity

MissingValueSimilarity.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)
```

**Parameters**

* (required) `real_data`: A pandas.Series containing a single column with missing values
* (required) `synthetic_data`: A pandas.Series object with the synthetic version of the column

## FAQs

<details>

<summary>What kind of values count as missing?</summary>

We use the same convention as pandas for determining when a value is missing \[2]. Missing values in your data should be represented as `NaN` objects.

If you are using any special notation to denote missing values, convert them to `NaN` values before using this metric.

</details>

## References

\[1] <https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures>

\[2] <https://pandas.pydata.org/docs/user_guide/missing_data.html>
