CorrelationSimilarity

This metric measures the correlation between a pair of numerical columns and computes the similarity between the real and synthetic data -- aka it compares the trends of 2D distributions. This metric supports both the Pearson and Spearman's rank coefficients to measure correlation.

Data Compatibility

  • Numerical : This metric is meant for continuous, numerical data

  • Datetime : This metric converts datetime values into numerical values

This metric ignores missing values.

Score

(best) 1.0: The pairwise correlations of the real and synthetic data are exactly the same

(worst) 0.0: The pairwise correlations are as different as they can possibly be

Below is a graph that shows some fictitious data for 2 columns of real and synthetic data (black and blue, respectively). The Correlation Similarity Score is 0.64.

How does it work?

For a pair of columns, A and B, this test computes a correlation coefficient on the real and synthetic data, R and S. This yields two separate correlation values. The test normalizes and returns a similarity score using the formula below.

score=1SA,BRA,B2score = 1 -\frac{|S_{A,B} - R_{A,B}|}{2}

Note that there are multiple ways to compute the correlation coefficient. This supports both the Pearson correlation coefficient [1][2] and the Spearman's rank correlation coefficient [3][4]. Both are bounded between -1 and +1.

Pearson vs. Spearman Coefficients

The Pearson and Spearman rank correlation coefficients are commonly used in data science applications. The Pearson coefficient measures whether two columns are linearly correlated while the Spearman measures whether they are monotonically related.

Both coefficients range from -1 to +1. A rough interpretation is given in the table below.

ScorePearson CoefficientSpearman's Rank Coefficient

+1

As one column increases, the other increases linearly

As one column increases, the other increases too

0

As one column increases, the other column has no linear pattern

As one column increases, the other column has no pattern

-1

As one column increases, the other decreases linearly

As one column increases, the other decreases too

Usage

Recommended Usage: The Quality Report applies this metric to every pair of compatible columns and provides visualizations to understand the score.

To manually run this metric, access the column_pairs module and use the compute method.

from sdmetrics.column_pairs import CorrelationSimilarity

CorrelationSimilarity.compute(
    real_data=real_table[['column_1', 'column_2']],
    synthetic_data=synthetic_table[['column_1', 'column_2']],
    coefficient='Pearson'
)

Parameters

  • (required) real_data: A pandas.DataFrame object containing 2 columns of real data

  • (required) synthetic_data: A pandas.DataFrame object containing 2 columns of synthetic data

  • coefficient: A string that describes the correlation coefficient to use:

    • (default) 'Pearson' for the Pearson correlation coefficient [1]

    • 'Spearman' for the Spearman's rank correlation coefficient [3]

FAQs

When should I use the Pearson vs. the Spearman coefficient?

The difference between Pearson and Spearman is in whether we assume the real data has a linear trend. Use a coefficient based on what you expect the real data to have and what you hope the synthetic data will be able to effectively capture.

Note that the Spearman coefficient may be slower to compute.

Is there an equivalent to this metric for categorical columns?

If you want to compute the similarity between two categorical columns, use the ContingencySimilarity metric.

However, if you want to compute a similarity between 1 numerical and 1 categorical column, there is is no standard procedure. One option may be to discretize the numerical column by breaking it up into multiple histogram bins. Then you can treat this column as categorical and use it with ContingencySimilarity. Note that this approach will no longer factor in the order in the numerical values.

Technical Note: What is captured by this metric?

The correlation describes whether the data closely follows a trend or whether it's noisy. The CorrelationSimilarity score describes whether the correlations of the real and synthetic data are similar.

Be careful when interpreting this metric, as some scenarios are not easily apparent.

  • Correlation does not describe any details about the trend, such as the slope of a line or its overall shape. For example, the data below has a near perfect score of 0.98 because both the real and synthetic data have strongly positive linear correlations. However, the slopes of the lines are different.

  • The CorrelationSimilarity score can be high if your data is noisy. If both the real and synthetic data don't have any clear trends, the correlation for both will be around 0. In this case, you will see a high CorrelationSimilarity score, indicating that the the synthetic data is successfully capturing the non-existent "trend".

References

[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

[2] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

[3] https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

[4] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Last updated