# StatisticSimilarity

This metric measures the similarity between a real column and a synthetic column by comparing a summary statistic. Supported summary statistics are: mean, median and standard deviation.

## Data Compatibility

* **Numerical** : This metric is meant for continuous, numerical data
* **Datetime** : This metric converts datetime values into numerical values

This metric ignores missing values.

## Score

**(best) 1.0**: The statistic for the real data is exactly the same at the synthetic data

**(worst) 0.0**: The statistic for the real data is extremely different from the synthetic data

## How does it work?

This test computes the given statistical function, *f*, for the real data and synthetic columns, *r* and *s*. Then, the test normalizes the score by scaling and taking its complement. This create a score that falls within the \[0, 1] range\*, where a high value means high similarity.&#x20;

$$
score = 1 - \frac{| f(r) - f(s) |}{\max(r) - \min(r)}
$$

The supported statistical functions (*f*) are: the (arithmetic) mean, median and standard deviation.

*\*In rare cases, where the synthetic data statistic is very different from the real data, the computed score may be negative. In such cases we clip the score to 0, the worst possible score.*

## Usage

Access this metric from the `single_column` module and use the `compute` method.

```python
from sdmetrics.single_column import StatisticSimilarity

StatisticSimilarity.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
    statistic='mean'
)
```

**Parameters**

* (required) `real_data`: A pandas.Series object with the column of real data
* (required) `synthetic_data`: A pandas.Series object with the column of synthetic data&#x20;
* `statistic`: A string describing the name of the statistical function
  * (default) `'mean'`: The arithmetic mean
  * `'median'`: The median value
  * `'std'`: The standard deviation

## FAQs

<details>

<summary>Is it better to use the <code>mean</code> or the <code>median</code> statistic?</summary>

The mean and median summarize the values differently, especially when your data has a skew \[1].

* The mean takes all values into account. It may be significantly affected by just a few large or small values. You can use the mean for a fast computation if you know there is no skew in your data or if you are ok with outliers affecting the score.
* The median finds a middle value where 50% of the data is larger and 50% is smaller. The median is resilient to outliers in either direction. This may be desirable if you have skewed data, as the score is more representative of the typical values. Note that this computation takes longer.

</details>

<details>

<summary>If I get a high score, does that mean my data looks exactly the same?</summary>

A high score is indicative that the summarized statistic are close to each other. However, even if all statistics are exactly 1.0, you may still find some differences in the shapes of the synthetic vs. real data.

If you are interested in the comparing the overall shapes, see the [KSComplement](/sdmetrics/data-metrics/quality/kscomplement.md) metric.

</details>

## References

\[1] <https://en.wikipedia.org/wiki/Skewness>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdmetrics/data-metrics/quality/statisticsimilarity.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
