> For the complete documentation index, see [llms.txt](https://docs.sdv.dev/sdmetrics/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sdv.dev/sdmetrics/data-metrics/metrics-in-beta/cstest.md). # CSTest This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes. You can think of the shape as what you observe when you plot a bar graph of the column. ## Data Compatibility * **Categorical**: This metric is meant for discrete, categorical data * **Boolean**: This metric works on boolean data {% hint style="warning" %} This metric does not accept missing values {% endhint %} ## Score **(best) 1.0**: The p-value is high, indicating that the synthetic data is not very different from the real data **(worst) 0.0**: The p-value is low, indicating that the synthetic data is significantly different than the real data ## How does it work? This test normalizes the real and synthetic data in order to compute the category frequencies. Then, it applies the Chi-squared test \[1] to test the null hypothesis that the synthetic data comes from the same distribution as the real data. The test returns the p-value \[2], where a smaller p-value indicates that the synthetic data is significantly different from the real data, rejecting the null hypothesis and leading to a worse overall score. ## Usage Access this metric from the `single_column` module and use the `compute` method. ```python from sdmetrics.single_column import CSTest CSTest.compute( real_data=real_column, synthetic_data=synthetic_column ) ``` **Parameters** * (required) `real_data`: A pandas.Series containing a single column * (required) `synthetic_data`: A similar pandas.Series object with the synthetic version of the column ## FAQs {% hint style="info" %} **This metric is in Beta.** Be careful when using the metric and interpreting its score. * This test is invalid when the category frequencies are too small \[3]. Because this test normalizes all frequencies, they are all <1, affecting the overall results. * The p-value may be hacked by supplying data of different sizes * The p-value interpretation may not be useful. Most users are interested in quantifying the differences between real and synthetic data, not testing the null hypothesis. In fact, if the synthetic data model is a simplification of the real data, the null hypothesis will nearly always be always be false with enough data. Consider switching to the [TVComplement](/sdmetrics/data-metrics/quality/tvcomplement.md) metric to quantify the differences between discrete real and synthetic columns. {% endhint %} ## References \[1] \[2] \[3] --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.sdv.dev/sdmetrics/data-metrics/metrics-in-beta/cstest.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.