LogoLogo
GitHubSlackDataCebo
  • SDMetrics
  • Getting Started
    • Installation
    • Quickstart
    • Metadata
      • Single Table Metadata
      • Multi Table Metadata
      • Sequential Metadata
  • Reports
    • Quality Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Diagnostic Report
      • What's included?
      • Single Table API
      • Multi Table API
    • Other Reports
    • Visualization Utilities
  • Metrics
    • Diagnostic Metrics
      • BoundaryAdherence
      • CardinalityBoundaryAdherence
      • CategoryAdherence
      • KeyUniqueness
      • ReferentialIntegrity
      • TableStructure
    • Quality Metrics
      • CardinalityShapeSimilarity
      • CategoryCoverage
      • ContingencySimilarity
      • CorrelationSimilarity
      • KSComplement
      • MissingValueSimilarity
      • RangeCoverage
      • SequenceLengthSimilarity
      • StatisticMSAS
      • StatisticSimilarity
      • TVComplement
    • Privacy Metrics
      • DCRBaselineProtection
      • DCROverfittingProtection
      • DisclosureProtection
      • DisclosureProtectionEstimate
      • CategoricalCAP
    • ML Augmentation Metrics
      • BinaryClassifierPrecisionEfficacy
      • BinaryClassifierRecallEfficacy
    • Metrics in Beta
      • CSTest
      • Data Likelihood
        • BNLikelihood
        • BNLogLikelihood
        • GMLikelihood
      • Detection: Sequential
      • Detection: Single Table
      • InterRowMSAS
      • ML Efficacy: Sequential
      • ML Efficacy: Single Table
        • Binary Classification
        • Multiclass Classification
        • Regression
      • NewRowSynthesis
      • * OutlierCoverage
      • Privacy Against Inference
      • * SmoothnessSimilarity
  • Resources
    • Citation
    • Contributions
      • Defining your metric
      • Development
      • Release FAQs
    • Enterprise
      • Domain Specific Reports
    • Blog
Powered by GitBook
On this page
  • Data Compatibility
  • Score
  • How does it work?
  • Usage
  • References
  1. Metrics
  2. Quality Metrics

CardinalityShapeSimilarity

PreviousQuality MetricsNextCategoryCoverage

Last updated 1 year ago

If you have multi table, connected tables, this metric measures whether the cardinality of the parent table is the same between the real and synthetic datasets. The cardinality is defined as the number of child rows for each parent.

Data Compatibility

  • ID: This metic is meant to be used on ID columns (primary and foreign keys). Primary key IDs must be unique while foreign key IDs can repeat.

ID columns cannot have any missing values.

Score

(best) 1.0: The cardinality values are the same in the real and synthetic data

(worst) 0.0: The cardinality values are as different as can be

The example below shows a distribution of cardinality values for real and synthetic data (black and green, respectively). The CardinalityShapeSimilarity score is 0.85, indicating that the cardinalities are mostly similar with some key differences.

How does it work?

In a multi table setup, there is a parent and child table. The parent contains a primary key that uniquely identifies every row while the child contains a foreign key that refers to a parent row. The foreign keys may repeat, as multiple children can reference the same parent.

This metric computes the cardinality [1] of each parent row. That is, it computes the number of children that each parent rows has so that each parent row is associated with an integer ≥ 0.

Usage

Access this metric from the multi_table module and use the compute_breakdown method.

from sdmetrics.multi_table import CardinalityShapeSimilarity

CardinalityShapeSimilarity.compute_breakdown(
    real_data={
      'user': real_user_table,
      'sessions': real_sessions_table,
      'transactions': real_transactions_table
    },
    synthetic_data={
      'users': synthetic_user_table,
      'sessions': real_sessions_table,
      'transactions': real_transactions_table
    },
    metadata=multi_table_metadata_dict
)
{
    ('users', 'sessions'): 0.78891,
    ('sessions', 'transactions'): 0.588211
}

Parameters

  • (required) real_data: A dictionary mapping table names to pandas.DataFrame objects that contain the real data

  • (required) synthetic_data: A dictionary mapping the same table names to pandas.DataFrame objects that contain the synthetic data

Returns A dictionary that maps each relationship to its CardinalityShapeSimilarity score.

References

This yields a numerical distribution for both the real and synthetic data. The CardinalityShapeSimilarity metric computes and returns the score of these distributions.

(required) metadata: A metadata dictionary describing the relationships between the tables (see )

[1]

KSComplement
Multi Table Metadata
https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
This graph shows the distribution of the cardinality for the real and synthetic data. In the real data, a vast majority of rows have a cardinality of 1. In the synthetic data, the cardinality is more evenly distributed in the [0,3] range.
The parent table contains primary keys while the child table has foreign keys that refers to them. Each parent row has a different number of children based on the references. For example, User_00 has 1 child row, User_01 has 2, user_02 has 0 and so on.