Search…
⌃K
Links

CategoryCoverage

This metric measures whether a synthetic column covers all the possible categories that are present in a real column.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data
  • Boolean: This metric is meant for boolean data
This metric ignores missing values.

Score

  • (best) 1.0: The synthetic column contains all the unique categories present in the real column
  • (worst) 0.0: The synthetic column contains none of the categories present in the real column
The plot below shows some fictitious real and synthetic data (black and green respectively) with CategoryCoverage=0.6.
The real data contains 5 unique categories: Science, Fine Arts, Arts, Business Administration and Other. However, the synthetic data only includes 3 of those categories, therefore the category coverage is 3/5.

How does it work?

This metric first computes the number of unique categories, c, that are present in the real column r. Then it computes the number of those categories present in the synthetic column, s. It returns the proportion of real categories that are in the synthetic data.
score=cscrscore = \frac{c_s}{c_r}

Usage

To manually apply this metric, access the single_column module and use the compute method.
from sdmetrics.single_column import CategoryCoverage
CategoryCoverage.compute(
real_data=real_table['column_name'],
synthetic_data=synthetic_table['column_name']
)
Parameters
  • (required) real_data: A pandas.Series object with the column of real data
  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Use the RangeCoverage metric with continuous data, for example numerical or datetime values that don't represent distinct categories.
Ideally, your synthetic column contains all the possible values that were present in the real data. Low coverage may be due to different factors:
  • You didn't create enough synthetic data: If you have a lot of unique categories or an uneven balance of categories, you may need to generate more synthetic data for all the categories to show up.
  • Your synthetic data model learned to ignore some categories: This is highly dependent on the model you used to generate the synthetic data. If your model uses a Generative Adversarial Network (GAN) you may be experiencing mode collapse [1], where a GAN fails to generalize the data. Check with your synthetic data provider to improve the modeling process.
A high score that the synthetic data has at least 1 example of each category. It does not indicate anything about the frequency of the categories. That is, the model may be over or under-sampling certain categories.
To measure the similarity of the frequencies, use the the TVComplement metric.
This metric will ignore new categories that appear in the synthetic data.
If your synthetic data contains new category values that were not in the real data, this might be indicative that your data is not actually categorical. It may represent private or sensitive data that has been anonymized. In this case, ensure that you have listed the column as 'pii' in the metadata.

References