Metadata
Last updated
Last updated
This guide describes the SDGym's metadata specification for a single table of data.
Metadata is a basic, factual description of a dataset that includes:
The type of data that each column represents
The primary keys and other identifiers of the table
How does the SDGym library use metadata? Many of the synthesizer use information in the metadata to create higher quality synthetic data. For example, the SDV Synthesizers apply different logic to different column types.
Additionally, the evaluation framework factors in the metadata when applying metrics. For example, some metrics may only be applicable for specific column types.
The SDGym library expects that every dataset will have corresponding metadata provided as a JSON file. During benchmarking, the SDGym reads the file as a Python dictionary.
We assume that the data is present in a CSV format that describes rows and columns of a single table.
The metadata has two keys:
"primary_key"
: the column name used to identify a row in the table
(required) "columns"
: a dictionary description of each column
The "columns"
key describes each column. It contains the name of the column, followed by the type of data and any other information about it. There are specific data types to choose from.
Boolean columns represent True or False values.
Properties (None)