Data Preparation

Multi table data is present in multiple tables that each have rows and columns. The tables are connected to each other through foreign and primary key references.

Before you begin creating synthetic data, it's important to have your data ready in the right format:

Data, a dictionary that maps every table name to a pandas DataFrame object containing the actual data
Metadata, a Metadata object that describes your table. It includes the data types in each column, keys and the connections between tables.

Click to see the metadata

{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "guests": {
          "primary_key": "guest_email",
          "alternate_keys": ["credit_card_number"],
          "columns": {
            "guest_email": { "sdtype": "email", "pii": True },
            "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
            "has_rewards": { "sdtype": "boolean" },
            "room_type": { "sdtype": "categorical" },
            "amenities_fee": { "sdtype": "numerical" },
            "checkin_date": { "sdtype": "datetime", "datetime_format":  "%d %b %Y"},
            "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y"},
            "room_rate": { "sdtype": "numerical" },
            "billing_address": { "sdtype": "address", "pii": True},
            "credit_card_number": { "sdtype": "credit_card_number", "pii": True}
          }
        },
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "state": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" },
                "classification": { "sdtype": "categorical" }
            }
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}

Learn More

Loading Data

Get started with a demo dataset or load your own data.

Creating Metadata

Create an object to describe the different columns in your data. Save it for future use.

Multi Table Schemas

What kinds of multi table schemas are compatible with the SDV? The SDV can be used to model many different types of multi table dataset schemas as long as they meet the criteria below.

There should be no cyclical dependencies. For eg, a table cannot refer to itself. Or if table A refers to table B, then table B cannot refer back to table A.
There should be no missing references (aka orphan rows). If a table A refers to table B, then every reference must be found. Note that it is ok if a parent row has no children.
The relationships should be one-to-many. SDV supports relationships between a parent primary key and a child foreign key. It does not support many-to-many or one-to-one relationships, though there are ways to workaround this for your schema.

Note that as of SDV 1.14.0, it is ok if your tables are not all connected to each other. This means, it's ok to have separate, disconnected groups of tables within a synthesizer.

PreviousEmpirical Differential Privacy NextLoading Data

Last updated 2 months ago