Data Preparation

Multi table data is present in multiple tables that each have rows and columns. The tables are connected to each other through foreign and primary key references.

Before you begin creating synthetic data, it's important to have your data ready in the right format:

  1. Data, a dictionary that maps every table name to a pandas DataFrame object containing the actual data

  2. Metadata, a MultiTableMetadata object that describes your table. It includes the data types in each column, keys and the connections between tables.

Click to see the metadata
{
    "METADATA_SPEC_VERSION": "MULTI_TABLE_V1",
    "tables": {
        "guests": {
          "primary_key": "guest_email",
          "alternate_keys": ["credit_card_number"],
          "columns": {
            "guest_email": { "sdtype": "email", "pii": True },
            "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
            "has_rewards": { "sdtype": "boolean" },
            "room_type": { "sdtype": "categorical" },
            "amenities_fee": { "sdtype": "numerical" },
            "checkin_date": { "sdtype": "datetime", "datetime_format":  "%d %b %Y"},
            "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y"},
            "room_rate": { "sdtype": "numerical" },
            "billing_address": { "sdtype": "address", "pii": True},
            "credit_card_number": { "sdtype": "credit_card_number", "pii": True}
          }
        },
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "state": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" },
                "classification": { "sdtype": "categorical" }
            }
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}

Learn More

Get started with a demo dataset or load your own data.

Create an object to describe the different columns in your data. Save it for future use.

Multi Table Schemas

What kinds of multi table schemas are compatible with the SDV? The SDV can be used to model many different types of multi table dataset schemas as long as they meet the criteria below.

  1. All the tables should be connected in some way. If you have disjoint sets of tables, you can model each set separately.

  2. There should be no cyclical dependencies. For eg, a table cannot refer to itself. Or if table A refers to table B, then table B cannot refer back to table A.

  3. There should be no missing references (aka orphan rows). If a table A refers to table B, then every reference must be found. Note that it is ok if a parent row has no children.

  4. The relationships should be one-to-many. SDV supports relationships between a parent primary key and a child foreign key. It does not support many-to-many or one-to-one relationships, though there are ways to workaround this for your schema.

Last updated

Copyright (c) 2023, DataCebo, Inc.