CategoryCoverage

This metric measures whether a synthetic column covers all the possible categories that are present in a real column.

Data Compatibility

  • Categorical: This metric is meant for discrete, categorical data

  • Boolean: This metric is meant for boolean data

This metric ignores missing values.

Score

  • (best) 1.0: The synthetic column contains all the unique categories present in the real column

  • (worst) 0.0: The synthetic column contains none of the categories present in the real column

The plot below shows some fictitious real and synthetic data (black and green respectively) with CategoryCoverage=0.6.

How does it work?

This metric first computes the number of unique categories, c, that are present in the real column r. Then it computes the number of those categories present in the synthetic column, s. It returns the proportion of real categories that are in the synthetic data.

Usage

To manually apply this metric, access the single_column module and use the compute method.

from sdmetrics.single_column import CategoryCoverage

CategoryCoverage.compute(
    real_data=real_table['column_name'],
    synthetic_data=synthetic_table['column_name']
)

Parameters

  • (required) real_data: A pandas.Series object with the column of real data

  • (required) synthetic_data: A pandas.Series object with the column of synthetic data

FAQs

Is there an equivalent metric for continuous data?

Use the RangeCoverage metric with continuous data, for example numerical or datetime values that don't represent distinct categories.

What does it mean if I see low category coverage?

Ideally, your synthetic column contains all the possible values that were present in the real data. Low coverage may be due to different factors:

  • You didn't create enough synthetic data: If you have a lot of unique categories or an uneven balance of categories, you may need to generate more synthetic data for all the categories to show up.

  • Your synthetic data model learned to ignore some categories: This is highly dependent on the model you used to generate the synthetic data. If your model uses a Generative Adversarial Network (GAN) you may be experiencing mode collapse [1], where a GAN fails to generalize the data. Check with your synthetic data provider to improve the modeling process.

Does high coverage that mean my synthetic data is similar to the real data?

A high score that the synthetic data has at least 1 example of each category. It does not indicate anything about the frequency of the categories. That is, the model may be over or under-sampling certain categories.

To measure the similarity of the frequencies, use the the TVComplement metric.

What if there is a new category in the synthetic data?

This metric will ignore new categories that appear in the synthetic data.

If your synthetic data contains new category values that were not in the real data, this might be indicative that your data is not actually categorical. It may represent private or sensitive data that has been anonymized. In this case, ensure that you have listed the column as 'pii' in the metadata.

References

[1] https://en.wikipedia.org/wiki/Generative_adversarial_network

Last updated