Metadata
This guide describes the SDGym's metadata specification for a single table of data.
What is metadata?
Metadata is a basic, factual description of a dataset that includes:
The type of data that each column represents
The primary keys and other identifiers of the table
The SDGym library expects that every dataset will have corresponding metadata provided as a JSON file. During benchmarking, the SDGym reads the file as a Python dictionary.
Example
We assume that the data is present in a CSV format that describes rows and columns of a single table.

Overview
The metadata for a single table contains the following elements:
(required)
"METADATA_SPEC_VERSION": The version of the metadata. If you are using this, the metadata version will be"V1", indicating that it is a multi table dataset that is compatible with SDV version 1.(required)
"tables": A dictionary that maps the table names to the table-specific metadata such as primary keys, column names and data types. Note that SDGym only works with single-table schemas.
Tables
The tables dictionary maps each table name to the table-specific metadata. Because SDGym only works with single-table schemas, the table name does not matter. But please be sure that the table-specific metadata matches your data.
(required)
"columns": A dictionary that maps the column names to the data types they represent and any other attributes."primary_key": The column name that is the primary key in the table"alternate_keys": A list of column names that can act as alternate keys in the table
Table Columns
When describing a column, you will provide the column name and the data type, known as the sdtype.
The 5 common sdtypes are: "numerical", "datetime", "categorical", "boolean" and "id". Click on the type below to learn more about the type and how to specify it in the metadata.
Table Metadata
Each table in the metadata has two keys:
"primary_key": the column name used to identify a row in the table(required)
"columns": a dictionary description of each column
Boolean columns represent True or False values.
"active" : {
"sdtype": "boolean"
}Properties (None)
Categorical columns represent discrete data
"tier" : {
"sdtype": "categorical"
}Properties (None)
Date columns represent a point in time
"renew_date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
}Properties
(required)
datetime_format: A string describing the format as defined by Python's strftime module.
Numerical columns represents discrete or continuous numerical values.
"age": {
"sdtype": "numerical",
"computer_representation": "Int64"
},
"paid_amt": {
"sdtype": "numerical",
"computer_representation": "Float"
}Properties
computer_representation: A string that represents how you'll ultimately store the data. This determines the min and max values allowed Available options are:'Float','Int8','Int16','Int32','Int64','UInt8','UInt16','UInt32','UInt64'
Use "type": "numerical" to specify columns that represent whole number or continuous values
ID columns represent identifiers that do not have any special mathematical or semantic meaning
"user_id": {
"sdtype": "id",
"regex_format": "U_[0-9]{3}"
}Properties
regex_format: A string describing the format of the ID as a regular expression
You can input any other data type such as 'phone_number', 'ssn' or 'email'. See the Sdtypes Reference for a full list.
"address": {
"sdtype": "address",
"pii": true
}Properties
pii: A boolean denoting whether the data is sensitive(default)
true: The column is sensitive, meaning the values should be anonymizedfalse: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data
FAQs
Last updated