Use this guide to write a description for a single data table. In a single table, all your data is captured in a 2D format using rows and columns.
Your data description is called metadata. SDMetrics expects metadata as a Python dictionary object.
Click to see the table's metadataThis is the metadata dictionary for the illustrated table
Copy {
"primary_key": "user_id",
"columns": {
"user_id": {
"sdtype": "id",
"regex_format": "U_[0-9]{3}"
},
"age": {
"sdtype": "numerical"
},
"address": {
"sdtype": "address",
"pii": True
},
"tier": {
"sdtype": "categorical"
},
"active": {
"sdtype": "boolean"
},
"paid_amt": {
"sdtype": "numerical"
},
"renew_date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
}
}
}
The metadata has two keys:
"primary_key"
: the column name used to identify a row in your table
(required) "columns"
: a dictionary description of each column
Copy {
"primary_key": "user_id",
"columns": { <column information> }
}
Inside "columns"
, you will describe each column. You'll start with the name of the column. Then you'll specify the type of data and any other information about it. There are specific data types to choose from. Expand the options below to learn about the data types.
boolean categorical datetime numerical id other
Boolean columns represent True or False values.
Copy "active": {
"sdtype": "boolean"
}
Properties (None)
Categorical columns describe discrete data.
Copy "tier": {
"sdtype": "categorical",
}
Properties (None)
Date columns represent a point in time
Copy "renew_date": {
"sdtype": "datetime",
"format": "%Y-%m-%d"
}
Properties
Numerical columns represents discrete or continuous numerical values.
Copy "age": {
"sdtype": "numerical"
},
"paid_amt": {
"sdtype": "numerical",
"compute_representation": "Float"
}
Properties
computer_representation
: A string that represents how you'll ultimately store the data. This determines the min and max values allowed
Available options are: 'Float'
, 'Int8'
, 'Int16'
, 'Int32'
, 'Int64'
, 'UInt8'
, 'UInt16'
, 'UInt32'
, 'UInt64'
ID columns represent identifiers that do not have any special mathematical or semantic meaning
Copy "user_id": {
"sdtype": "id",
"regex_format": "U_[0-9]{3}"
}
Properties
Copy "address": {
"sdtype": "address",
"pii": True
}
Properties
pii
: A boolean denoting whether the data is sensitive
(default) True
: The column is sensitive, meaning the synthetic data is anonymized
False
: The column is not sensitive, meaning the synthetic data may not be anonymized
After creating your dictionary, you can save it as a JSON file. For example, my_metadata_file.json
.
Copy import json
with open('my_metadata_file.json', 'w') as f:
json.dump(my_metadata_dict, f)
In the future, you can load the Python dictionary by reading from the file.
Copy import json
with open('my_metadata_file.json') as f:
my_metadata_dict = json.load(f)
# use my_metadata_dict in the SDMetrics library