CardinalityBoundaryAdherence
Last updated
Last updated
If there are two connected tables, the cardinality refers to the number of connections between a parent row and the child. This metric measures whether the cardinality of the synthetic data follows the min/max values as determined by the real data.
Foreign Key : This metric is meant for foreign keys
Primary Key : This metric validates that the foreign key values are found in the primary key
This metric ignores missing values in the foreign key.
(best) 1.0: The cardinality of the synthetic data is always in the min/max bounds as determined by the real data.
(worst) 0.0: The cardinality of the synthetic data is never whether the min/max bounds.
The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The real data has a min cardinality of 0 and a max of 4. Since the synthetic data is contained within these bounds, the score is 1.0.
In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.
This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0. This yields a set of values for both the real data, r, and the synthetic data, s. The score is based on the proportion of rows in s that follow the min/max boundary.
Recommended Usage: The Diagnostic Report applies this metric to applicable columns.
To manually apply this metric, access the column_pairs
module and use the compute
method.
Parameters
(required) real_data
: A tuple of 2 pandas.Series objects. The first represents the primary key of the real data and the second represents the foreign key.
(required) synthetic_data
: A tuple of pandas.Series objects. The first represents the primary key of the synthetic data and the second represents the foreign key.
[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
User_00
has 1 child row, User_01
has 2, user_02
has 0 and so on.