CardinalityShapeSimilarity
Last updated
Last updated
If you have multi table, connected tables, this metric measures whether the cardinality of the parent table is the same between the real and synthetic datasets. The cardinality is defined as the number of child rows for each parent.
ID: This metic is meant to be used on ID columns (primary and foreign keys). Primary key IDs must be unique while foreign key IDs can repeat.
ID columns cannot have any missing values.
(best) 1.0: The cardinality values are the same in the real and synthetic data
(worst) 0.0: The cardinality values are as different as can be
The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The CardinalityShapeSimilarity score is 0.85, indicating that the cardinalities are mostly similar with some key differences.
In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.
This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0.
This yields a numerical distribution for both the real and synthetic data. The CardinalityShapeSimilarity metric computes and returns the KSComplement score of these distributions.
Access this metric from the multi_table
module and use the compute_breakdown
method.
Parameters
(required) real_data
: A dictionary mapping table names to pandas.DataFrame objects that contain the real data
(required) synthetic_data
: A dictionary mapping the same table names to pandas.DataFrame objects that contain the synthetic data
(required) metadata
: A metadata dictionary describing the relationships between the tables (see Multi Table Metadata)
Returns A dictionary that maps each relationship to its CardinalityShapeSimilarity
score.
[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
User_00
has 1 child row, User_01
has 2, user_02
has 0 and so on.