CategoryAdherence

This metric measures whether a synthetic column adheres to the same category values as the real data. (The synthetic data should not be inventing new category values that are not originally present in the real data.)

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric is meant for boolean data

If you have missing values in the real data, then the metric will consider them valid in the synthetic data. Otherwise, they will be marked as an invalid category. All types of missing values (NaN, None, etc. will be counted as the same category of 'missing'.)

Score

  • (best) 1.0: All category values in the synthetic data were present in the real data

  • (worst) 0.0: None of the category values in the synthetic data were present in the real data

Any score in between tells us the proportion of data points that are adhering to the correct values. For example, 0.6 means that 60% of synthetic data points have a value present in in the real data. Meanwhile, the remaining 40% contain new values that were never present in the real data.

How does it work?

This metric extracts the set of unique categories, that are present in the real column, Cr.

Then it finds the of data points of the synthetic data, s, that are found in the set C. The score is the proportion of these data points as compared to all the synthetic data points.

score=s,sCrsscore = \frac{| s, s \in C_r |}{|s|}

Usage

Recommended Usage: The Diagnostic Report applies this metric to applicable columns.

To manually apply this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import CategoryAdherence

CategoryAdherence.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Is there an equivalent metric for continuous data?

For continuous datasets, many values are possible. Use the BoundaryAdherence metric to ensure they within the correct min/max bounds.

Does this metric measure quality or data coverage?

No. This metric is a measure of validity, as we generally consider discrete data to be valid only if it contains the correct category values.

  • Data quality refers to the frequency of each particular category value. To compare this, use the TVComplement metric.

  • Data coverage refers to the idea that the synthetic data should cover at least 1 of each category value. To measure this, use the CategoryCoverage metric.

Last updated