# ContingencySimilarity

This metric computes the similarity of a pair of categorical columns between the real and synthetic datasets -- aka it compares 2D distributions.

## Data Compatibility

* **Categorical**: This metric is meant for discrete, categorical data
* **Boolean**: This metric works well on boolean data
* **Numerical**: This metric discretizes numerical data into bins
* **Datetime**: This metric discretizes continuous datetime values into bins

To use this metric, both of the columns must be compatible. If there are missing values in the columns, the metric will treat them as an additional, single category.

## Score

**(best) 1.0**: The contingency table is exactly the same between the real vs. synthetic data

**(worst) 0.0**: The contingency table is as different as can be

The plots below show an example of fictitious real and synthetic data with ContingencySimilarity=0.92.

![In this contingency table, a categorical column describing a country (vertical) is compared with a boolean column describing whether a user is subscribed (horizontal). The real and synthetic data have similar breakdowns for each (country, subscribed) combination, so the contingency similarity is high at 0.92. They are not exactly the same, so the score is <1.](https://2284413265-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FrNLha4DaPNwVJ930KhmB%2Fuploads%2FTPBa95q90y0vkyvL1oao%2Fnewplot%20\(4\).png?alt=media\&token=db07a508-0ea0-46b0-b520-1d8e0a7899ea)

## How does it work?

For a pair of columns, *A* and *B*, the test computes a normalized contingency table \[1] for the real and synthetic data. This table describes the proportion of rows that have each combination of categories in *A* and *B*.

Then, it computes the difference between the contingency tables using the Total Variation Distance \[2]. Finally, we subtract the distance from 1 to ensure that a high score means high similarity. The process is summarized by the formula below.&#x20;

$$
score = 1 - \frac{1}{2}\sum\_{\alpha \in A}\sum\_{\beta \in B} |S\_{\alpha, \beta} - R\_{\alpha, \beta}|
$$

In the formula, α describes all the possible categories in column A and β describes all the possible categories in column B. Meanwhile, R and S refer to the real and synthetic frequencies for those categories.

## Usage

{% hint style="success" %}
**Recommended Usage:** The [Quality Report](https://docs.sdv.dev/sdmetrics/data-metrics/quality/quality-report/single-table-api) applies this metric to every pair of compatible columns and provides visualizations to understand the score.&#x20;
{% endhint %}

To manually run this metric, access the `column_pairs` module and use the `compute` method.

```python
from sdmetrics.column_pairs import ContingencySimilarity

ContingencySimilarity.compute(
    real_data=real_table[['column_1', 'column_2']],
    synthetic_data=synthetic_table[['column_1', 'column_2']]
)
```

**Parameters**

* (required) `real_data`: A pandas.DataFrame object containing 2 columns of real data
* (required) `synthetic_data`: A pandas.DataFrame object containing 2 columns of synthetic data
* `num_rows_subsample`: The number of rows to subsample before running this metric. Use this option to get an estimate of the full ContingencySimilarity score with faster performance.
  * (default) `None`: Do not subsample the rows. Use the full dataset to compute the score.
  * `<integer>`: Randomly subsample the provided number of rows for both the real and the synthetic datasets before computing the metric
* `continuous_column_names`: A list of column names that represent continuous values. Such columns will be discretized into bins before applying this metric.
  * (default) `None`: None of the columns are continuous
  * `[<column names>]`: Each column name in the list will be discretized into bins before applying this metric.
* `num_discrete_bins`: The number of discrete bins to create for continuous columns
  * (default) `10`: Discretize continuous columns into 10 bins
  * `<int>`: Discretize continuous columns into the number of bins provided
* `real_association_threshold`: The strength of the association that the real data has to have as a prerequisite. If the real data does not have a strong enough association, then the score returned is NaN, indicating that a contingency similarity can't be computed.
  * (default) `None`: Do not apply a threshold. Supply a score regardless of any trends in the real data.
  * `float`: A value between \[0, 1]. The association is considered strong if its Cramer's V coefficient \[3] is greater than the provided value. This only provides a score if the real association is strong.

## FAQs

<details>

<summary>Is there an equivalent to this metric for numerical columns?</summary>

If you want to compute a similarity between 1 numerical and 1 categorical column, provide the column names in the `continuous_column_names` parameter. This discretizes the values into histogram bins, and then treats the column as categorical. Note that this approach will no longer factor in the order in the numerical values.

If you want to compute the similarity between two numerical columns, use the [CorrelationSimilarity](https://docs.sdv.dev/sdmetrics/data-metrics/quality/correlationsimilarity) metric.&#x20;

</details>

<details>

<summary>Can you compare trends between 3 or more columns?</summary>

Currently, the SDMetrics library does not support any explicit similarity metrics for higher order trends.

You may be interested in browsing through the experimental [ML efficacy](https://docs.sdv.dev/sdmetrics/data-metrics/metrics-in-beta/ml-efficacy-single-table) or [detection](https://docs.sdv.dev/sdmetrics/data-metrics/metrics-in-beta/detection-single-table) metrics, which can factor in 3 or more columns when computing their score.

</details>

<details>

<summary>What threshold should I apply?</summary>

It's useful to evaluate whether the synthetic data matches the trends in the real data. But if there are no trends present in the real data, then it depends on what you want to measure.

* If you do not apply a threshold, then the metric consider it a success if the real and synthetic data both have no significant trends
* If you apply a threshold, then the metric effectively filters out any pairs of columns that don't have strong trends in the real data. *We recommend setting this to 0.3 or above to filter to strong associations.*

</details>

<details>

<summary>How can I speed up this metric?</summary>

This metric computes a full contingency table for all combinations of values in the 2 columns, for the real and synthetic data. This process may take some time if your datasets are large and if there are a large number of possible values.

To speed up the metric, we recommend using the `num_rows_subsample` parameter that subsamples the real and synthetic data before computing the contingency table. Ultimately, this score will be an estimate of the true score (based on the whole data), but in practice, we do not see significant changes if you make sure to keep a couple thousand rows.

An additional way to speed up the metric is, if you have continuous columns, decrease the  number of discrete bins to limit the size of the contingency table. Note that this may increase your score because you are counting larger variations of numbers as a single bin.

</details>

**Technical Notes: Association**

It is possible to use an association measure to quantify whether a contingency table is biased towards certain categories. There are many ways to compute association, a common coefficient being Cramer's V \[3].

The association score measures the degree of bias but not its direction. This tells us whether there is a significant trend in the real data alone, but we shouldn't directly compare this number with the synthetic data. In the example below, both the real and synthetic data have the same, high association score even though they are biased towards different categories. If we simply compared association scores, we would erroneously conclude that the similarity is high, at 1.0.

![The association of the real and synthetic data are both high because there is a significant bias in the categories. In the real data, subscribed users are biased to be in the US but in the synthetic data, they are biased to be in Mexico.](https://2284413265-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FrNLha4DaPNwVJ930KhmB%2Fuploads%2FP185slHNJz6HHMuxCuYP%2FContingencySimilarity_association.png?alt=media\&token=f56b5126-d8db-4037-bda4-dc94be2c324a)

This metric uses the association to determine whether there is a strong trend in the real data (based on the provided threshold). But to perform its calculation, it directly compares the values in the contingency tables.

## References

\[1] <https://en.wikipedia.org/wiki/Contingency_table>

\[2] <https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures>

\[3] [https://en.wikipedia.org/wiki/Cram%C3%A9r%27s\_V](https://en.wikipedia.org/wiki/Cram%C3%A9r's_V)
