Multi Table Metadata API
This guide will walk you through creating the metadata using the Python API.
Get started by creating a blank
MultiTableMetadata
object.from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata()
Automatically detect the metadata based on your actual data. Different methods are available based on the format of your data.
DataFrame
CSV
detect_from_dataframes
: Use this function to automatically detect metadata from multiple tables at a time. metadata.detect_from_dataframes(
data={
'hotels': hotels_dataframe,
'guests': guests_dataframe
}
)
Parameters
- (required)
data
: A dictionary mapping the name of the table to the pandas.DataFrame that contains its data
Output (None)
Deprecation Notice: In previous versions of the SDV, you could detect one table at a time using
detect_table_from_dataframe
. We do not recommend using this older function.detect_from_csvs
: Use this function to automatically detect metadata from muliple CSV filesmetadata.detect_from_csvs(
folder_name='my_data_folder/'
)
Parameters
- (required)
folder_name
: The name of the folder that contains all your CSV files. This method assumes each CSV file is a separate table of your metadata.
Output (None)
Deprecation Notice: In previous versions of the SDV, you could detect one table at a time using
detect_table_from_csv
. We do not recommend using this older function.The detected metadata is not guaranteed to be accurate or complete. Be sure to carefully inspect the metadata and update information.
- Primary keys and other identifiers are auto-detected, but may be incorrect or incomplete. See
set_primary_key
andadd_alternate_keys
method to add them. - Sensitive information may not be auto-detected. Check for columns with an
'unknown'
sdtype and use theupdate_column
method to update them. - Simple relationships between tables are auto-detected, but may be incomplete. Use
add_relationship
to add them.
At any point, during the metadata creation or updates, you can inspect the current state of the metadata.
Use this to get a copy of the Python dictionary that corresponds to the metadata.
Parameters (None)
Output A Python dictionary that corresponds to the metadata
python_dict = metadata.to_dict()
Note that the returned object is a representation of the metadata. Changing it will not modify the original metadata object in any way.
Use this to this to see a visual representation of the metadata. Use the parameters to control the level of details in the visualization and for saving the image.
Parameters
show_table_details
: Toggle the display of column details
(default) 'full' | Show all the different column names, primary keys and foreign keys |
'summarized' | Summarize the columns based on the data type |
None | Hide the details. Only show the table name. |
show_relationship_labels
: Toggle the display of the table relationships
(default) True | Label each relationship between 2 tables with the column names |
False | Do not label the relationships. Only show an arrow between tables. |
output_filepath
: If provided, save the image at the given location in the given format
The
output_filepath
must end with the filetype that you want to save as. Popular examples are png
, jpg
or pdf
.metadata.visualize(
show_table_details='full',
show_relationship_labels=True,
output_filepath='my_metadata.png'
)

Use this to validate that the metadata is written according to the specification. This function will throw descriptive errors if there is anything wrong with the metadata.
Parameters (None)
Output (None)
metadata.validate()
InvalidMetadataError: The metadata is not valid
Error: Invalid values ("pii") for datetime column "start_date".
Error: Invalid regex format string "[A-{6}" for id column "hotel_id"
Use this method to validate that the metadata accurately describes a particular dataset. This function will throw descriptive errors if there is any mismatch between the metadata and data.
Parameters:
- (required)
data
: A dictionary containing your multi-table data. Each key should be the name of a table and the value should be a pandas.DataFrame containing its data. The data should have the same tables and columns as described in the metadata.
Output (None)
metadata.validate_data(data={
'hotels': hotels_dataframe,
'guests': guests_dataframe
})
It is important to verify and update any inaccuracies in the metadata
Use this method to modify the information about a column in your metadata.
Parameters
- (required)
table_name
: The name of the table to update - (required)
column_name
: The name of the column in that table to update - (required)
sdtype
: A string describing the statistical data type. Common types are'boolean'
,'categorical'
,'datetime'
,'numerical'
and'id'
. But other types such as'phone_number'
are also available (see SDTypes). <other properties>
: Based on the sdtype, provide other parameters. See below for some options.
boolean
categorical
datetime
numerical
id
other
Boolean columns represent True or False values.
metadata.update_column(
table_name='guests',
column_name='has_rewards',
sdtype='boolean')
Properties (None)
Categorical columns represent discrete data
metadata.update_column(
table_name='guests',
column_name='room_type',
sdtype='categorical')
Properties (None)
Date columns represent a point in time
metadata.update_column(
table_name='guests',
column_name='checkin_date',
sdtype='datetime',
datetime_format='%d %b %Y')
Properties
Numerical columns represents discrete or continuous numerical values.
metadata.update_column(
table_name='hotels',
column_name='rating',
sdtype='numerical',
computer_representation='Float')
Properties
computer_representation
: A string that represents how you'll ultimately store the data. This determines the min and max values allowed (Default)Float
Available options are:'Float'
,'Int8'
,'Int16'
,'Int32'
,'Int64'
,'UInt8'
,'UInt16'
,'UInt32'
,'UInt64'
ID columns represent identifiers that do not have any special mathematical or semantic meaning
metadata.update_column(
table_name='hotels',
column_name='hotel_id',
sdtype='id',
regex_format='HID_[0-9]{3}')
Properties
You can input any other data type such as
'phone_number'
, 'ssn'
or 'email'
. See the Sdtypes Reference for a full list.metadata.update_column(
table_name='guests',
column_name='guest_email',
sdtype='email',
pii=True
)
Properties
pii
: A boolean denoting whether the data is sensitive- (default)
True
: The column is sensitive, meaning the values should be anonymized False
: The column is not sensitive, meaning the exact set of values can be reused in the synthetic data
Output (None)
Use this function to set the primary key of the table. Any existing primary keys will be removed.
The primary key uniquely identifies every row in the table. When you set a primary key, the SDV will guarantee that every value in the table is unique. At this time, the SDV does not support composite keys.
Parameters
- (required)
table_name
: The name of the table - (required)
column_name
: The column name of the primary key. The column name must already be defined in the metadata and it must be an ID or another PII sdtype.
Output (None)
metadata.set_primary_key(
table_name='hotels',
column_name='hotel_id'
)
metadata.set_primary_key(
table_name='guests',
column_name='guest_email'
)
Use this function to set alternate keys of the table. This method will add to any existing alternate keys you may have.
Similar to primary keys, alternate keys are also unique in your table. However, other tables do not reference alternate keys.
Parameters
- (required)
table_name
: The name of the table - (required)
column_names
: A list of column names that represent the alternate keys in the table. All column names must already be defined in the metadata and they must be IDs or other PII sdtypes.
Output (None)
metadata.add_alternate_keys(
table_name='guests',
column_names=['credit_card_number']
)
Use this method to add a relationship between 2 connected tables: A parent and child table. The parent table contains the primary key references while the child table has rows that refer to its parent. Multiple child rows can refer to the same parent row.
Parameters:
- (required)
parent_table_name
: The name of the parent table - (required)
child_table_name
: The name of the child table that refers to the parent - (required)
parent_primary_key
: The primary key column in the parent table. This column uniquely identifies each row in the parent table . - (required)
child_foreign_key
: The foreign key column in the child table. The values in this column contain a reference to a row in the parent table
Output (None)
metadata.add_relationship(
parent_table_name='hotels',
child_table_name='guests',
parent_primary_key='hotel_id',
child_foreign_key='hotel_id'
)
Use this method to remove the connection between a parent and child table. In the case where there are multiple connections, this method will remove all the connections. Use this if the metadata has incorrectly detected relationships.
Parameters:
- (required)
parent_table_name
: The name of the parent table - (required)
child_table_name
: The name of the child table that refers to the parent
Output (None)
metadata.remove_relationship(
parent_table_name='hotels',
child_table_name='guests'
)
You can save the metadata object as a JSON file and load it again for future use.
Use this to save the metadata object to a new JSON file that will be compatible with SDV 1.0 and beyond. We recommend you write the metadata to a new file every time you update it.
Parameters
- (required)
filepath
: The location of the file that will be created with the JSON metadata
Output (None)
metadata.save_to_json(filepath='metadata.json')
If you already have a metadata JSON file, you can load it in as a
MultiTableMetadata
object. Use the method based on the version of your JSON file.JSON file for SDV 1.0+
Older JSON files
load_from_json
: If you recently wrote your JSON file for SDV, use this class method to load it as a MultiTableMetadata
object.from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata.load_from_json(
filepath='metadata.json')
Parameters
- (required)
filepath
: The name of the file containing the JSON metadata
Output A
MultiTableMetadata
objectupgrade_metadata
: If you wrote a JSON file for any SDV version before 1.0, use this class method to upgrade the metadata.You have older metadata if you see a key named
"fields"
instead of "columns"
.from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata.upgrade_metadata(
filepath='my_old_metadata.json'
)
Parameters
- (required)
old_filepath
: The filepath to your older metadata JSON file
Output A
MultiTableMetadata
object with the new metadataTip! After upgrading, save your metadata so you can use it again
metadata.save_to_json('my_new_metadata.json')
Use this class method to load a Python dictionary as a
MultiTableMetadata
object.- (required)
metadata_dict
: A Python dictionary representation of the metadata. See Metadata Spec for more details.
Output A MultiTableMetadata object
from sdv.metadata import MultiTableMetadata
metadata_obj = MultiTableMetadata.load_from_dict(metadata_dict)
Use this method to anonymize the column names of your metadata. This makes it easier to share your metadata, eg. for debugging purposes.
Parameters (None)
Output A new MultiTableMetadata object that represents the anonymized metadata
anonymized_metadata = original_metadata.anonymize()
*This feature is only available for licensed, enterprise users. To learn more about the SDV Enterprise features and purchasing a license, get in touch with us.
The anonymized metadata contains new table and column names. The original names are obfuscated, but the sdtypes and other formatting information remains the same.
>>> anonymized_metadata.to_dict()
{
'tables': {
'3oc2d': {
'primary_key': 'id_0',
'columns': {
'id_0': { 'sdtype': 'id', 'regex_format': 'ID_[0-9]{10}' },
'num_0': { 'sdtype': 'numerical' },
'cat_0': { 'sdtype': 'categorical' },
'dt_0': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
'pii_0': { 'sdtype': 'ssn' },
...
},
'dpco1': {
...
}
},
'relationships': [{
'parent_table_name': '3oc2d',
'child_table_name': 'dpco1',
'parent_primary_key': 'id_0',
'child_foreign_key': 'id_2'
}]
}
Last modified 12d ago