CardinalityBoundaryAdherence

If there are two connected tables, the cardinality refers to the number of connections between a parent row and the child. This metric measures whether the cardinality of the synthetic data follows the min/max values as determined by the real data.

Data Compatibility

  • Foreign Key : This metric is meant for foreign keys

  • Primary Key : This metric validates that the foreign key values are found in the primary key

This metric ignores missing values in the foreign key.

Score

  • (best) 1.0: The cardinality of the synthetic data is always in the min/max bounds as determined by the real data.

  • (worst) 0.0: The cardinality of the synthetic data is never whether the min/max bounds.

The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The real data has a min cardinality of 0 and a max of 4. Since the synthetic data is contained within these bounds, the score is 1.0.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0. This yields a set of values for both the real data, r, and the synthetic data, s. The score is based on the proportion of rows in s that follow the min/max boundary.

score=s,smin(r) and smax(r)sscore = \frac{| s, s\ge min(r) \text{ and } s\le max(r)|}{| s|}

Usage

Recommended Usage: The Diagnostic Report applies this metric to applicable columns.

To manually apply this metric, access the column_pairs module and use the compute method.

from sdmetrics.column_pairs import CardinalityBoundaryAdherence

CardinalityBoundaryAdherence.compute(
    real_data=(real_table['primary_key'], real_table['foreign_key']),
    synthetic_data=(synthetic_table['primary_key'], synthetic_table['foreign_key'])
)

Parameters

  • (required) real_data: A tuple of 2 pandas.Series objects. The first represents the primary key of the real data and the second represents the foreign key.

  • (required) synthetic_data: A tuple of pandas.Series objects. The first represents the primary key of the synthetic data and the second represents the foreign key.

References

[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)

Last updated