Search…
⌃K
Links

CardinalityShapeSimilarity

If you have multi table, connected tables, this metric measures whether the cardinality of the parent table is the same between the real and synthetic datasets. The cardinality is defined as the number of child rows for each parent.

Data Compatibility

  • ID: This metic is meant to be used on ID columns (primary and foreign keys). Primary key IDs must be unique while foreign key IDs can repeat.
ID columns cannot have any missing values.

Score

(best) 1.0: The cardinality values are the same in the real and synthetic data
(worst) 0.0: The cardinality values are as different as can be
The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The CardinalityShapeSimilarity score is 0.85, indicating that the cardinalities are mostly similar with some key differences.
This graph shows the distribution of the cardinality for the real and synthetic data. In the real data, a vast majority of rows have a cardinality of 1. In the synthetic data, the cardinality is more evenly distributed in the [0,3] range.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.
The parent table contains primary keys while the child table has foreign keys that refers to them. Each parent row has a different number of children based on the references. For example, User_00 has 1 child row, User_01 has 2, user_02 has 0 and so on.
This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0.
This yields a numerical distribution for both the real and synthetic data. The CardinalityShapeSimilarity metric computes and returns the KSComplement score of these distributions.

Usage

Access this metric from the multi_table module and use the compute_breakdown method.
from sdmetrics.multi_table import CardinalityShapeSimilarity
CardinalityShapeSimilarity.compute_breakdown(
real_data={
'user': real_user_table,
'sessions': real_sessions_table,
'transactions': real_transactions_table
},
synthetic_data={
'users': synthetic_user_table,
'sessions': real_sessions_table,
'transactions': real_transactions_table
},
metadata=multi_table_metadata_dict
)
{
('users', 'sessions'): 0.78891,
('sessions', 'transactions'): 0.588211
}
Parameters
  • (required) real_data: A dictionary mapping table names to pandas.DataFrame objects that contain the real data
  • (required) synthetic_data: A dictionary mapping the same table names to pandas.DataFrame objects that contain the synthetic data
  • (required) metadata: A metadata dictionary describing the relationships between the tables (see Multi Table Metadata)
Returns A dictionary that maps each relationship to its CardinalityShapeSimilarity score.

References