Search…
⌃K
Links

Sequential Metadata

Use this guide to write a description for a single data table that represents sequential data, for example, a timeseries. In sequential data, rows have a specific order. Your data table may contain multiple, independent sequences belonging to different entities. See the diagram below for an illustration of sequential data.
This example shows sequential data related to vital signs. The table contains multiple sequences, each corresponding to a different patient. For each sequences, health measurements change over time.
Your data description is called metadata. SDMetrics expects metadata as a Python dictionary object.
This is the metadata dictionary for the illustrated sequential table
{
"entity_columns": ["Patient ID"],
"sequence_index": "Time",
"context_columns": ["Address", "Smoker"],
"fields": {
"Patient ID": {
"type": "id",
"subtype": "string",
"regex": "ID_[0-9]{3}"
},
"Address": {
"type": "categorical",
"pii": True
},
"Smoker": {
"type": "boolean"
},
"Time": {
"type": "datetime",
"format": "%m/%d/%Y"
},
"Heart Rate": {
"type": "categorical"
},
"Systolic BP": {
"type": "numerical",
"subtype": "int"
}
}
}

Metadata Specification

The file is an object can have multiple keys:
  • "primary_key": the column name used to identify a row in your table
  • "sequence_index": the column name used to order the rows in the table
  • "entity_columns": a list of column names. Together, the column names are used to identify a single sequence in your data.
  • "context_columns": a list of column names that remain constant throughout a sequence
  • (required) "fields": a dictionary description of each column
{
"sequence_index": "Time",
"entity_columns": ["Patient ID"],
"context_columns": ["Address", "Smoker"],
"fields": { <column information> }
}

Column Information (Fields)

Inside "fields", you will describe each column. You'll start with the name of the column. Then you'll specify the type of data and any other information about it.
There are specific data types to choose from. Expand the options below to learn about the data types.
categorical
datetime
numerical
boolean
id
Use "type": "categorical" to represent data that has discrete categories
"tier": {
"type": "categorical",
}
Properties
  • "pii": True or False to represent whether the data is sensitive, meaning it should be anonymized in the synthetic data. By default, we assume the data is not sensitive.
Use "type": "datetime" to specify columns that represent points in time
"renew_date": {
"type": "datetime",
"format": "%Y-%m-%d"
}
Properties
  • (required) "format" to describe the format of the datetime string
The format string has special values to describe the components. For example, Jan 06, 2022 is represented as "%b %d, %Y"
See this documentation for a full list. Common values are:
  • Year: "%Y" for a 4-digit year like 2022, or "%y" for a 2-digit year like 22
  • Month: "%m" for a 2-digit month like 01, "%b" for an abbreviated month like Jan
  • Day: "%d" for a 2-digit day like 06
Use "type": "numerical" to specify columns that represent whole number or continuous values
"age": {
"type": "numerical",
"subtype": "integer",
},
"paid_amt": {
"type": "numerical",
"subtype": "float",
}
Properties
  • (required) "subtype": "float" or "integer" to specify whether this is a continuous value or a whole number
Use "type": "boolean" to represent column that have True/False values.
"active": {
"type": "boolean"
}
Properties: None
Use "type": "id" to represent any columns that act as row identifiers for the table. In a single table, the ID column typically be the primary key
"user_id": {
"type": "id",
"subtype": "string",
"regex": "U_[0-9]{3}"
}
Properties
  • (required) "subtype": Either an "integer" or "string"
  • "regex": A regular expression describing how to create the id, if the id is a string

Saving & Loading Metadata

After creating your dictionary, you can save it as a JSON file. For example, my_metadata_file.json.
import json
with open('my_metadata_file.json', 'w') as f:
json.dump(my_metadata_dict, f)
In the future, you can load the Python dictionary by reading from the file.
import json
with open('my_metadata_file.json') as f:
my_metadata_dict = json.load(f)
# use my_metadata_dict in the SDMetrics library