# ContingencySimilarity

This metric computes the similarity of a pair of categorical columns between the real and synthetic datasets -- aka it compares 2D distributions.

**Categorical**: This metric is meant for discrete, categorical data**Boolean**: This metric works well on boolean data

To use this metric, both of the columns must be compatible. If there are missing values in the columns, the metric will treat them as an additional, single category.

**(best) 1.0**: The contingency table is exactly the same between the real vs. synthetic data

**(worst) 0.0**: The contingency table is as different as can be

The plots below show an example of fictitious real and synthetic data with ContingencySimilarity=0.92.

In this contingency table, a categorical column describing a country (vertical) is compared with a boolean column describing whether a user is subscribed (horizontal). The real and synthetic data have similar breakdowns for each

`(country, subscribed)`

combination, so the contingency similarity is high at 0.92. They are not exactly the same, so the score is <1.For a pair of columns,

*A*and*B*, the test computes a normalized contingency table [1] for the real and synthetic data. This table describes the proportion of rows that have each combination of categories in*A*and*B*.Then, it computes the difference between the contingency tables using the Total Variation Distance [2]. Finally, we subtract the distance from 1 to ensure that a high score means high similarity. The process is summarized by the formula below.

$score = 1 - \frac{1}{2}\sum_{\alpha \in A}\sum_{\beta \in B} |S_{\alpha, \beta} - R_{\alpha, \beta}|$

In the formula, α describes all the possible categories in column A and β describes all the possible categories in column B. Meanwhile, R and S refer to the real and synthetic frequencies for those categories.

**Recommended Usage:**The Quality Report applies this metric to every pair of compatible columns and provides visualizations to understand the score.

To manually run this metric, access the

`column_pairs`

module and use the `compute`

method.from sdmetrics.column_pairs import ContingencySimilarity

ContingencySimilarity.compute(

real_data=real_table[['column_1', 'column_2']],

synthetic_data=synthetic_table[['column_1', 'column_2']]

)

**Parameters**

- (required)
`real_data`

: A pandas.DataFrame object containing 2 columns of real data - (required)
`synthetic_data`

: A pandas.DataFrame object containing 2 columns of synthetic data

If you want to compute the similarity between two numerical columns, use the CorrelationSimilarity metric.

However, if you want to compute a similarity between 1 numerical and 1 categorical column, there is is no standard procedure. One option may be to discretize the numerical column by breaking it up into multiple histogram bins. Then you can treat this column as categorical and use it with this metric. Note that this approach will no longer factor in the order in the numerical values.

Currently, the SDMetrics library does not support any explicit similarity metrics for higher order trends.

You may be interested in browsing through the experimental ML efficacy or detection metrics, which can factor in 3 or more columns when computing their score.

**Technical Notes: Association**

It is possible to use an association measure to quantify whether a contingency table is biased towards certain categories. There are many ways to compute association, a common coefficient being Cramer's V [3].

The association score measures the degree of bias but not its direction. This can lead to misleading results when comparing real and synthetic data. In the example below, both the real and synthetic data have the same, high association score even though they are biased towards different categories. If we simply compared association scores, we would erroneously conclude that the similarity is high, at 1.0.

The association of the real and synthetic data are both high because there is a significant bias in the categories. In the real data, subscribed users are biased to be in the US but in the synthetic data, they are biased to be in Mexico.

This is why SDMetrics does

*not*use association metrics to compare real and synthetic data.

Last modified 3mo ago