This metric measures the correlation between a pair of numerical columns and computes the similarity between the real and synthetic data -- aka it compares the trends of 2D distributions. This metric supports both the Pearson and Spearman's rank coefficients to measure correlation.
- Numerical : This metric is meant for continuous, numerical data
- Datetime : This metric converts datetime values into numerical values
This metric ignores missing values.
(best) 1.0: The pairwise correlations of the real and synthetic data are exactly the same
(worst) 0.0: The pairwise correlations are as different as they can possibly be
Below is a graph that shows some fictitious data for 2 columns of real and synthetic data (black and blue, respectively). The Correlation Similarity Score is 0.64.
The real data has a strongly positive correlation of 0.93 but the synthetic data has a weak correlation of 0.22. The overall similarity score is 0.64, capturing the fact that the synthetic data has a noisier trend.
For a pair of columns, A and B, this test computes a correlation coefficient on the real and synthetic data, R and S. This yields two separate correlation values. The test normalizes and returns a similarity score using the formula below.
Note that there are multiple ways to compute the correlation coefficient. This supports both the Pearson correlation coefficient  and the Spearman's rank correlation coefficient . Both are bounded between -1 and +1.
The Pearson and Spearman rank correlation coefficients are commonly used in data science applications. The Pearson coefficient measures whether two columns are linearly correlated while the Spearman measures whether they are monotonically related.
Both coefficients range from -1 to +1. A rough interpretation is given in the table below.
To manually run this metric, access the
column_pairsmodule and use the
from sdmetrics.column_pairs import CorrelationSimilarity
real_data: A pandas.DataFrame object containing 2 columns of real data
synthetic_data: A pandas.DataFrame object containing 2 columns of synthetic data
coefficient: A string that describes the correlation coefficient to use:
'Pearson'for the Pearson correlation coefficient 
'Spearman'for the Spearman's rank correlation coefficient 
The difference between Pearson and Spearman is in whether we assume the real data has a linear trend. Use a coefficient based on what you expect the real data to have and what you hope the synthetic data will be able to effectively capture.
Note that the Spearman coefficient may be slower to compute.
However, if you want to compute a similarity between 1 numerical and 1 categorical column, there is is no standard procedure. One option may be to discretize the numerical column by breaking it up into multiple histogram bins. Then you can treat this column as categorical and use it with ContingencySimilarity. Note that this approach will no longer factor in the order in the numerical values.
The correlation describes whether the data closely follows a trend or whether it's noisy. The CorrelationSimilarity score describes whether the correlations of the real and synthetic data are similar.
Be careful when interpreting this metric, as some scenarios are not easily apparent.
- Correlation does not describe any details about the trend, such as the slope of a line or its overall shape. For example, the data below has a near perfect score of 0.98 because both the real and synthetic data have strongly positive linear correlations. However, the slopes of the lines are different.
- The CorrelationSimilarity score can be high if your data is noisy. If both the real and synthetic data don't have any clear trends, the correlation for both will be around 0. In this case, you will see a high CorrelationSimilarity score, indicating that the the synthetic data is successfully capturing the non-existent "trend".