ReferentialIntegrity

This metric measures the integrity of a connection between a foreign key and primary key. Every value in the foreign key column must be found in the primary key.

Data Compatibility

  • Foreign Key : This metric is meant for foreign keys

  • Primary Key : This metric validates that the foreign key values are found in the primary key

This metric counts missing values as valid foreign keys.

Score

  • (best) 1.0: All the foreign key values are found in the primary key

  • (worst) 0.0: None of the foreign key values are found in the primary key. This indicates that the dataset has orphan children, which is invalid in most database systems.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

If s represents the synthetic data, then this metric identifies whether the foreign key values (FK) in s match a value in the primary key (PK) of s. The score is the proportion of foreign key values that are found in the primary key column.

score=sFK,sFKsPKsFKscore = \frac{|s_{FK}, s_{FK} \in s_{PK} | }{| s_{FK} |}

Note that if a foreign key value is missing, this metric counts is as a valid, meaning that it will be included in the numerator.

Usage

Recommended Usage: The Diagnostic Report applies this metric to applicable columns.

To manually apply this metric, access the column_pairs module and use the compute method.

from sdmetrics.column_pairs import ReferentialIntegrity

ReferentialIntegrity.compute(
    real_data=(real_table['primary_key'], real_table['foreign_key']),
    synthetic_data=(synthetic_table['primary_key'], synthetic_table['foreign_key'])
)

Parameters

  • (required) real_data: A tuple of 2 pandas.Series objects. The first represents the primary key of the real data and the second represents the foreign key.

  • (required) synthetic_data: A tuple of pandas.Series objects. The first represents the primary key of the synthetic data and the second represents the foreign key.

FAQs

Should the score always be 1?

If you are running this score on a connection between primary key and foreign key, then the score should always be 1. Foreign keys are expected to always refer to a primary key in order to be valid for most database systems.

Does this metric use real data?

This metric checks to see if the real data also has referential integrity alerts you if this is not the case. However, the final score is only based on the synthetic data.

Last updated