Data Preparation
Multi table data is present in multiple tables that each have rows and columns. The tables are connected to each other through foreign and primary key references.

This example of a multi table dataset has a table for users and a table for their sessions. Each user can have multiple sessions recorded.
Before you begin creating synthetic data, it's important to have your data ready in the right format:
- 1.Data, a dictionary that maps every table name to a pandas DataFrame object containing the actual data
- 2.Metadata, a MultiTableMetadata object that describes your table. It includes the data types in each column, keys and the connections between tables.
{
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1",
"tables": {
"guests": {
"primary_key": "guest_email",
"alternate_keys": ["credit_card_number"],
"columns": {
"guest_email": { "sdtype": "email", "pii": True },
"hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
"has_rewards": { "sdtype": "boolean" },
"room_type": { "sdtype": "categorical" },
"amenities_fee": { "sdtype": "numerical" },
"checkin_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y"},
"checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y"},
"room_rate": { "sdtype": "numerical" },
"billing_address": { "sdtype": "address", "pii": True},
"credit_card_number": { "sdtype": "credit_card_number", "pii": True}
}
},
"hotels": {
"primary_key": "hotel_id",
"columns": {
"hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
"city": { "sdtype": "categorical" },
"state": { "sdtype": "categorical" },
"rating": { "sdtype": "numerical" },
"classification": { "sdtype": "categorical" }
}
}
},
"relationships": [{
"parent_table_name": "hotels",
"parent_primary_key": "hotel_id",
"child_table_name": "guests",
"child_foreign_key": "hotel_id"
}]
}
What kinds of multi table schemas are compatible with the SDV? The SDV can be used to model many different types of multi table dataset schemas as long as they meet the criteria below.
- 1.All the tables should be connected in some way. If you have disjoint sets of tables, you can model each set separately.
- 2.There should be no cyclical dependencies. For eg, a table cannot refer to itself. Or if table A refers to table B, then table B cannot refer back to table A.
- 3.There should be no missing references (aka orphan rows). If a table A refers to table B, then every reference must be found. Note that it is ok if a parent row has no children.
Last modified 1mo ago