ReferentialIntegrity

This metric measures the integrity of a connection between a foreign key and primary key. Every value in the foreign key column must be found in the primary key.

Data Compatibility

  • Foreign Key : This metric is meant for foreign keys

  • Primary Key : This metric validates that the foreign key values are found in the primary key

This metric counts missing values as valid foreign keys.

Score

  • (best) 1.0: All the foreign key values are found in the primary key

  • (worst) 0.0: None of the foreign key values are found in the primary key. This indicates that the dataset has orphan children, which is invalid in most database systems.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

The parent table contains primary keys while the child table has foreign keys that refers to them.

If s represents the synthetic data, then this metric identifies whether the foreign key values (FK) in s match a value in the primary key (PK) of s. The score is the proportion of foreign key values that are found in the primary key column.

score=sFK,sFKsPKsFKscore = \frac{|s_{FK}, s_{FK} \in s_{PK} | }{| s_{FK} |}

Note that if a foreign key value is missing, this metric counts is as a valid, meaning that it will be included in the numerator.

If there are multiple columns that form the primary/composite key (a composite key), then all of the values in the foreign key must match up exactly with all of the columns in the primary key for it to count as a match.

Usage

circle-check

To manually apply this metric, access the column_pairs module and use the compute method.

Parameters

  • (required) real_data: A tuple of 2 pandas.DataFrame objects. The first represents the primary key of the real data and the second represents the foreign key. For a composite key, provide multiple columns in each pandas.DataFrame object.

  • (required) synthetic_data: A tuple of pandas.DataFrame objects. The first represents the primary key of the synthetic data and the second represents the foreign key. For a composite key, provide multiple columns in each pandas.DataFrame object.

FAQs

chevron-rightShould the score always be 1?hashtag

If you are running this score on a connection between primary key and foreign key, then the score should always be 1. Foreign keys are expected to always refer to a primary key in order to be valid for most database systems.

chevron-rightDoes this metric use real data?hashtag

This metric checks to see if the real data also has referential integrity alerts you if this is not the case. However, the final score is only based on the synthetic data.

Last updated