Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Learn More
  • Multi Table Schemas
  1. Multi Table Data

Data Preparation

PreviousVisualizationNextLoading Data

Last updated 16 days ago

Multi table data is present in multiple tables that each have rows and columns. The tables are connected to each other through foreign and primary key references.

Before you begin creating synthetic data, it's important to have your data ready in the right format:

  1. Data, a dictionary that maps every table name to a pandas DataFrame object containing the actual data

  2. Metadata, a object that describes your table. It includes the data types in each column, keys and the connections between tables.

Click to see the metadata
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "guests": {
          "primary_key": "guest_email",
          "alternate_keys": ["credit_card_number"],
          "columns": {
            "guest_email": { "sdtype": "email", "pii": True },
            "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
            "has_rewards": { "sdtype": "boolean" },
            "room_type": { "sdtype": "categorical" },
            "amenities_fee": { "sdtype": "numerical" },
            "checkin_date": { "sdtype": "datetime", "datetime_format":  "%d %b %Y"},
            "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y"},
            "room_rate": { "sdtype": "numerical" },
            "billing_address": { "sdtype": "address", "pii": True},
            "credit_card_number": { "sdtype": "credit_card_number", "pii": True}
          }
        },
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "state": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" },
                "classification": { "sdtype": "categorical" }
            }
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}

Learn More

Multi Table Schemas

What kinds of multi table schemas are compatible with the SDV? The SDV can be used to model many different types of multi table dataset schemas as long as they meet the criteria below.

  1. There should be no cyclical dependencies. For eg, a table cannot refer to itself. Or if table A refers to table B, then table B cannot refer back to table A.

  2. There should be no missing references (aka orphan rows). If a table A refers to table B, then every reference must be found. Note that it is ok if a parent row has no children.

  3. The relationships should be one-to-many. SDV supports relationships between a parent primary key and a child foreign key. It does not support many-to-many or one-to-one relationships, though there are ways to workaround this for your schema.

Note that as of SDV 1.14.0, it is ok if your tables are not all connected to each other. This means, it's ok to have separate, disconnected groups of tables within a synthesizer.

Get started with a demo dataset or load your own data.

Create an object to describe the different columns in your data. Save it for future use.

Metadata

Loading Data
Creating Metadata
This example of a multi table dataset has a table for hotels and a table for their guests. Each hotel has multiple guests who have visited.