Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Overview
  • Tables
  • Table Columns
  • Column Relationships
  • Relationships
  • Multi Sequence Data
  1. Concepts
  2. Metadata

Metadata JSON

PreviousMetadata APINextConstraints

Last updated 8 months ago

This guide describes the metadata JSON spec.

Click to see the metadata JSON file

This is an example of a JSON file describing a multi table schema.

{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" }
            },
            "column_relationships": []
        },
        "guests": {
            "primary_key": "guest_email",
            "columns": {
                "guest_email": { "sdtype": "email" },
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "checkin_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "room_type": { "sdtype": "categorical" }
            },
            "column_relationships": []
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}

Overview

The metadata for a single table contains the following elements:

  • (required) "METADATA_SPEC_VERSION": The version of the metadata. If you are using this, the metadata version will be "V1", indicating that it is a multi table dataset that is compatible with SDV version 1.

  • (required) "tables": A dictionary that maps the table names to the table-specific metadata such as primary keys, column names and data types

  • (required) "relationships": A list of dictionaries that specify the connections between the tables

Tables

The tables dictionary maps each table name to the table-specific metadata. This includes:

  • (required) "columns": A dictionary that maps the column names to the data types they represent and any other attributes.

  • "primary_key": The column name that is the primary key in the table

  • "alternate_keys": A list of column names that can act as alternate keys in the table

Table Columns

When describing a column, you will provide the column name and the data type, known as the sdtype.

The 5 common sdtypes are: "numerical", "datetime", "categorical", "boolean" and "id". Click on the type below to learn more about the type and how to specify it in the metadata.

Boolean columns represent True or False values.

"is_active" : {
    "sdtype": "boolean"
}

Properties (None)

Categorical columns represent discrete data. By default, they are unordered (aka nominal data).

"room_type" : {
    "sdtype": "categorical"
}

Properties (None)

Date columns represent a point in time

"checkin_date": {
    "sdtype": "datetime", 
    "datetime_format": "%d %b %Y"
}

Properties

Numerical columns represents discrete or continuous numerical values.

"rating": {
    "sdtype": "numerical",
    "computer_representation": "Float"
}

Properties

  • computer_representation: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are: 'Float', 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'

ID columns represent identifiers that do not have any special mathematical or semantic meaning

"hotel_id": { 
    "sdtype": "id",
    "regex_format": "HID_[0-9]{3}"
}

Properties

"guest_email": {
    "sdtype": "email",
    "pii": true
}

Properties

  • pii: A boolean denoting whether the data is sensitive

    • (default) true: The column is sensitive, meaning the values should be anonymized

    • false: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data

Column Relationships

Annotate groups of columns that represents higher level concepts. Denote the concept using the "relationship_type" keyword, followed by "column_names" with the list of columns involved. The column names can be present in any order.

Each relationship type supports different types of columns. Browse the table below to explore different options.

An address is defined by 2 or more columns that have the following sdtypes: country_code, administrative_unit, state, state_abbr, city, postcode, street_address and secondary_address.

{
    "type": "address",
    "column_names": ["addr_line1", "addr_line2", "city", "state", "zipcode"]
}

A GPS coordinate pair is defined by 2 columns:

  • sdtype latitude &

  • sdtype longitude

{
    "type": "gps",
    "column_names": ["location_lat", "location_lon"]
}

Additional column relationships coming soon!

Relationships

A list of dictionary objects that describe the relationship between 2 connected tables, parent and child. The parent table contains the primary key references while the child table has rows that refer to its parent. Multiple child rows can refer to the same parent row.

  • "parent_table_name": The name of the parent table

  • "parent_primary_key": The primary key column in the parent table. This column uniquely identifies each row in the parent table .

  • "child_table_name": The name of the child table that refers to the parent

  • "child_foreign_key": The foreign key column in the child table. The values in this column contain a reference to a row in the parent table

Use multiple dictionaries to represent multiple tables.

"relationships": [{
    "parent_table_name": "users",
    "parent_primary_key": "user_id",
    "child_table_name": "sessions",
    "child_foreign_key": "user_id"
}, {
    "parent_table_name": "sessions",
    "parent_primary_key": "session_id",
    "child_table_name": "transaction",
    "child_foreign_key": "transacted_session_id"
}]

Multi Sequence Data

In some cases, you may have a table that describes multiple, ordered sequences of data such as the one shown below.

You can annotate the sequences within a table by applying the sequence_key and sequence_index keywords.

Click to see the metadata JSON
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "patients": {
            "sequence_key": "Patient ID",
            "sequence_index": "Time",
            "columns": {
                "Patient ID": { "sdtype": "id", "regex_format": "ID_[0-9]{3}" },
                "Address": { "sdtype": "physical_address", "pii": true },
                "Smoker": { "sdtype": "boolean" },
                "Time": { "sdtype": "datetime", "datetime_format": "%m/%d/%Y" },
                "Heart Rate": { "sdtype": "categorical" },
                "Systolic BP": { "sdtype": "numerical" }
            }
        }
    }
}

"sequence_key": A column name of the sequence key, if you have multi-sequence data

The sequence key is a column that identify which row(s) belong to which sequences. This is usually an ID column but it may also be a PII sdtype (such as "phone_number").

This is important for tables that contain multiple sequences. In our example, the sequence key is 'Patient ID' because this column is used to break up the sequences.

If you don't supply a sequence key, the SDV assumes that your table only contains a single sequence. Note: The SDV sequential models do not fully support single sequence data.

"sequence_index": A column name of the sequence index, if you have sequential data

The sequence index determines the spacing between the rows in a sequence. Use this if you have an explicit index such as a timestamp. If you don't supply a sequence index, the SDV assumes there is equal spacing of an unknown unit.

Create your metadata programmatically. Use the to automatically detect the metadata based on your data.

(required) datetime_format: A string describing the format as defined by .

regex_format: A string describing the format of the ID as a

You can input any other data type such as 'phone_number', 'ssn' or 'email'. See the for a full list.

Do you have a request for a type of column relationship? Please describing your use case.

While anyone can add column relationships to their data, SDV Enterprise users will see the highest quality data for the relationships. To learn more about the SDV Enterprise and its extra features, .

Python API
Python's strftime module
regular expression
file a feature request
visit our website
An example of multi table data. The tables are connected to each other through primary/foreign keys.
An example of sequential data. There are multiple sequences (one for each Patient ID). Within each sequence is an ordered set of rows.
Sdtypes Reference