CardinalityShapeSimilarity

If you have multi table, connected tables, this metric measures whether the cardinality of the parent table is the same between the real and synthetic datasets. The cardinality is defined as the number of child rows for each parent.

Data Compatibility

  • ID: This metic is meant to be used on ID columns (primary and foreign keys). Primary key IDs must be unique while foreign key IDs can repeat.

ID columns cannot have any missing values.

Score

(best) 1.0: The cardinality values are the same in the real and synthetic data

(worst) 0.0: The cardinality values are as different as can be

The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The CardinalityShapeSimilarity score is 0.85, indicating that the cardinalities are mostly similar with some key differences.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0.

This yields a numerical distribution for both the real and synthetic data. The CardinalityShapeSimilarity metric computes and returns the KSComplement score of these distributions.

Usage

Access this metric from the multi_table module and use the compute_breakdown method.

from sdmetrics.multi_table import CardinalityShapeSimilarity

CardinalityShapeSimilarity.compute_breakdown(
    real_data={
      'user': real_user_table,
      'sessions': real_sessions_table,
      'transactions': real_transactions_table
    },
    synthetic_data={
      'users': synthetic_user_table,
      'sessions': real_sessions_table,
      'transactions': real_transactions_table
    },
    metadata=multi_table_metadata_dict
)
{
    ('users', 'sessions'): 0.78891,
    ('sessions', 'transactions'): 0.588211
}

Parameters

  • (required) real_data: A dictionary mapping table names to pandas.DataFrame objects that contain the real data

  • (required) synthetic_data: A dictionary mapping the same table names to pandas.DataFrame objects that contain the synthetic data

  • (required) metadata: A metadata dictionary describing the relationships between the tables (see Multi Table Metadata)

Returns A dictionary that maps each relationship to its CardinalityShapeSimilarity score.

References

[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)

Last updated