CardinalityBoundaryAdherence

If there are two connected tables, the cardinality refers to the number of connections between a parent row and the child. This metric measures whether the cardinality of the synthetic data follows the min/max values as determined by the real data.

Data Compatibility

  • Foreign Key : This metric is meant for foreign keys

  • Primary Key : This metric validates that the foreign key values are found in the primary key

This metric ignores missing values in the foreign key.

Score

  • (best) 1.0: The cardinality of the synthetic data is always in the min/max bounds as determined by the real data.

  • (worst) 0.0: The cardinality of the synthetic data is never whether the min/max bounds.

The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The real data has a min cardinality of 0 and a max of 4. Since the synthetic data is contained within these bounds, the score is 1.0.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

The parent table contains primary keys while the child table has foreign keys that refers to them. Each parent row has a different number of children based on the references. For example, User_00 has 1 child row, User_01 has 2, user_02 has 0 and so on.

This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0. This yields a set of values for both the real data, r, and the synthetic data, s. The score is based on the proportion of rows in s that follow the min/max boundary.

score=s,smin(r) and smax(r)sscore = \frac{| s, s\ge min(r) \text{ and } s\le max(r)|}{| s|}

Usage

To manually apply this metric, access the column_pairs module and use the compute method.

from sdmetrics.column_pairs import CardinalityBoundaryAdherence

CardinalityBoundaryAdherence.compute(
    real_data=(real_table['primary_key'], real_table['foreign_key']),
    synthetic_data=(synthetic_table['primary_key'], synthetic_table['foreign_key'])
)

Parameters

  • (required) real_data: A tuple of 2 pandas.Series objects. The first represents the primary key of the real data and the second represents the foreign key.

  • (required) synthetic_data: A tuple of pandas.Series objects. The first represents the primary key of the synthetic data and the second represents the foreign key.

References

[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)

Last updated