Metadata
This guide describes the SDGym's metadata specification for a single table of data.
Metadata is a basic, factual description of a dataset that includes:
- The type of data that each column represents
- The primary keys and other identifiers of the table
How does the SDGym library use metadata? Many of the synthesizer use information in the metadata to create higher quality synthetic data. For example, the SDV Synthesizers apply different logic to different column types.
Additionally, the evaluation framework factors in the metadata when applying metrics. For example, some metrics may only be applicable for specific column types.
The SDGym library expects that every dataset will have corresponding metadata provided as a JSON file. During benchmarking, the SDGym reads the file as a Python dictionary.
We assume that the data is present in a CSV format that describes rows and columns of a single table.

This example of a single table includes a new row for each user. The row includes their personal information.
This is the metadata dictionary for the illustrated table.
{
"primary_key": "user_id",
"fields": {
"user_id": {
"type": "id",
"subtype": "string",
"regex": "U_[0-9]{3}"
},
"age": {
"type": "numerical",
"subtype": "integer"
},
"address": {
"type": "categorical",
"pii": true
},
"tier": {
"type": "categorical"
},
"active": {
"type": "boolean"
},
"paid_amt": {
"type": "numerical",
"subtype": "float"
},
"renew_date": {
"type": "datetime",
"format": "%Y-%m-%d"
}
}
}
The metadata has two keys:
"primary_key"
: the column name used to identify a row in the table- (required)
"fields"
: a dictionary description of each column
{
"primary_key": "user_id",
"fields": { <column information> }
}
The
"fields"
key describes each column. It contains the name of the column, followed by the type of data and any other information about it. There are specific data types to choose from.categorical
datetime
numerical
boolean
id
Use
"type": "categorical"
to represent data that has discrete categories"tier": {
"type": "categorical",
}
Properties
"pii"
:true
orfalse
to represent whether the data is sensitive, meaning it should be anonymized in the synthetic data. By default, we assume the data is not sensitive.
Use
"type": "datetime"
to specify columns that represent points in time"renew_date": {
"type": "datetime",
"format": "%Y-%m-%d"
}
Properties
- (required)
"format"
to describe the format of the datetime string
The format string has special values to describe the components. For example,
Jan 06, 2022
is represented as "%b %d, %Y"
- Year:
"%Y"
for a 4-digit year like 2022, or"%y"
for a 2-digit year like 22 - Month:
"%m"
for a 2-digit month like 01,"%b"
for an abbreviated month like Jan - Day:
"%d"
for a 2-digit day like 06
Use
"type": "numerical"
to specify columns that represent whole number or continuous values"age": {
"type": "numerical",
"subtype": "integer",
},
"paid_amt": {
"type": "numerical",
"subtype": "float",
}
Properties
- (required)
"subtype"
:"float"
or"integer"
to specify whether this is a continuous value or a whole number
Use
"type": "boolean"
to represent column that have True/False values."active": {
"type": "boolean"
}
Properties: None
Use
"type": "id"
to represent any columns that act as row identifiers for the table. In a single table, the ID column typically be the primary key"user_id": {
"type": "id",
"subtype": "string",
"regex": "U_[0-9]{3}"
}
Properties
- (required)
"subtype"
: Either an"integer"
or"string"
"regex"
: A regular expression describing how to create the id, if the id is a string
Yes, every dataset available in the SDGym's demo module has an associated metadata file. If you are supplying custom datasets, make sure to write an attach a metadata file too. See Datasets for more information.
Yes, your custom synthesizer can use any information in the metadata to help create higher quality synthetic data. The metadata information is passed into your synthesizer as a Python dictionary during the training process. See Custom Synthesizers for more information.
Last modified 4mo ago