Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • SDV treats metadata as the ground truth
  • Auto-detect and validate your metadata
  • Resources
  1. Concepts

Metadata

PreviousEvaluationNextSdtypes

Last updated 7 months ago

Metadata is a description of the dataset you want to synthesize. This could be one or multiple data tables. Metadata includes the names of tables, columns, data types in each of the columns, and relationships.

For example, your data may look like this:

Click below to see the metadata description for this dataset:

Click to see the metadata
{
    "METADATA_SPEC_VERSION": "V1",
    "tables": {
        "hotels": {
            "primary_key": "hotel_id",
            "columns": {
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "city": { "sdtype": "categorical" },
                "rating": { "sdtype": "numerical" }
            },
            "column_relationships": []
        },
        "guests": {
            "primary_key": "guest_email",
            "columns": {
                "guest_email": { "sdtype": "email" },
                "hotel_id": { "sdtype": "id", "regex_format": "HID_[0-9]{3}" },
                "checkin_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "checkout_date": { "sdtype": "datetime", "datetime_format": "%d %b %Y" },
                "room_type": { "sdtype": "categorical" }
            },
            "column_relationships": []
        }
    },
    "relationships": [{
        "parent_table_name": "hotels",
        "parent_primary_key": "hotel_id",
        "child_table_name": "guests",
        "child_foreign_key": "hotel_id"
    }]
}

SDV treats metadata as the ground truth

All SDV synthesizers frequently refer to metadata as the ground truth whenever they are creating or evaluating synthetic data. For high quality synthetic data generation, it's vital that your metadata accurately describes the data.

See metadata in action! Your data may have a column containing values such as 94102, 94130 and 94702. What should SDV learn about this data in order to create synthetic data?

If the metadata specifies the data is numerical, SDV learns these values on a sliding number scale. But if the metadata specifies the data actually represents concepts such as postal codes, SDV can better ensure your synthetic data will be valid.

Auto-detect and validate your metadata

SDV allows you to auto-detect metadata based on your data. Please spend some time inspecting and updating your metadata to ensure it accurately describes your data.

from sdv.metadata import Metadata

# 1. auto-detect metadata based in your data
metadata = Metadata.detect_from_dataframes(
    data={
        'hotels': hotels_dataframe,
        'guests': guests_dataframe
    })

# 2. carefully inspect and update your metadata
metadata.visualize()
metadata.update_column(
    column_name='room_type',
    sdtype='categorical',
    table_name='guests'
)

metadata.validate()

# 3. when you're done, save it to a file for future use
metadata.save_to_json('my_final_metadata.json')

Resources

Click the links below to get started with metadata.

Learn SDV's system for keeping track of data types for each column.

Use Python to auto-detect, inspect, and update metadata.

This is the final format that your metadata will be saved as.

Sdtypes
Metadata API
Metadata JSON
An example of multi table data. The tables are connected to each other through primary/foreign keys.