# Metadata API

## Creation API

There are various ways to create your metadata. For example, if you are an SDV Enterprise user, you can directly [connect to a database](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/loading-data#connect-to-a-database) to load data and create metadata all at once.&#x20;

Otherwise, if you are in possession of your data, you can auto-detect the metadata.

### detect\_from\_dataframes

Use this function to automatically detect metadata from your data that you've loaded as a pandas.DataFrame objects.

**Parameters**:

* (required) `data`: Your data, represented as a dictionary. The keys are your table names and values are the pandas.DataFrame objects containing your data.
* `infer_sdtypes`: A boolean describing whether to infer the sdtypes of each column
  * (default) `True`: Infer the sdtypes of each column based on the data.
  * `False`: Do not infer the sdtypes. All columns will be marked as unknown, ready for you to manually update.
* `infer_keys`: A string describing whether to infer the primary and/or foreign keys.
  * (default) `'primary_and_foreign'`: Infer the primary keys in each table, and the foreign keys in other tables that refer to them
  * `'primary_only'`: Infer the primary keys in each table. You can manually add the foreign key relationships later.
  * `None`: Do not infer any primary or foreign keys. You can manually add these later.
* `foreign_key_inference_algorithm`: The algorithm to use when inferring the foreign key connections to primary keys
  * (default) `'column_name_match'`: Match up foreign and primary key columns that have the same names
  * ＊(default, SDV Enterprise) `'data_match'`: Match up foreign and primary key columns based on the data that they contain

**Output** A Metadata object that describes the data

```python
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(
    data={
        'hotels': hotels_dataframe,
        'guests': guests_dataframe
    })
```

{% hint style="info" %}
**＊SDV Enterprise Feature.** This feature is only available for licensed, enterprise users. For more information, visit our page to [Compare SDV Features](https://docs.sdv.dev/sdv/explore/sdv-enterprise/compare-features).
{% endhint %}

{% hint style="danger" %}
**The detected metadata is not guaranteed to be accurate or complete.** Be sure to carefully inspect the metadata and update information
{% endhint %}

## Inspection API

At any point, during the metadata creation or updates, you can inspect the current state of the metadata.

### to\_dict

Use this to get a copy of the Python dictionary that corresponds to the metadata.

**Parameters** (None)

**Output** A Python dictionary that corresponds to the metadata

```python
python_dict = metadata.to_dict()
```

{% hint style="info" %}
Note that the returned object is a representation of the metadata. Changing it will not modify the original metadata object in any way.
{% endhint %}

### visualize

Use this to this to see a visual representation of the metadata. Use the parameters to control the level of details in the visualization and for saving the image.

**Parameters**&#x20;

* `show_table_details`: Toggle the display of column details

<table data-header-hidden><thead><tr><th width="212"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'full'</code></td><td>Show all the different column names, primary keys and foreign keys</td></tr><tr><td><code>'summarized'</code></td><td>Summarize the columns based on the data type</td></tr><tr><td><code>None</code></td><td><em>Hide the details. Only show the table name.</em></td></tr></tbody></table>

* `show_relationship_labels`: Toggle the display of the table relationships

<table data-header-hidden><thead><tr><th width="170"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>Label each relationship between 2 tables with the column names</td></tr><tr><td><code>False</code></td><td>Do not label the relationships. Only show an arrow between tables.</td></tr></tbody></table>

* `output_filepath`: If provided, save the image at the given location in the given format

{% hint style="warning" %}
The `output_filepath` must end with the filetype that you want to save as. Popular examples are `png`, `jpg` or `pdf`.
{% endhint %}

**Output** A [graphviz.graphs.Digraph](https://graphviz.readthedocs.io/en/stable/manual.html)

```python
metadata.visualize(
    show_table_details='full',
    show_relationship_labels=True,
    output_filepath='my_metadata.png'
)
```

<figure><img src="https://1967107441-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FfNxEeZzl9uFiJ4Zf4BRZ%2Fuploads%2FP08Tdn7GbwbDH4Jo6dD2%2FMultiTable%20Metadata%20Schema.png?alt=media&#x26;token=5a8a308c-3ee6-45c8-b4b4-e4058ecdd45b" alt=""><figcaption></figcaption></figure>

### get\_column\_names

Use this function to look up column names based on the metadata properties that they have.

{% hint style="info" %}
This is particularly useful if you want to list all columns assigned to a specific sdtype, such as `unknown` in order to update it.
{% endhint %}

**Parameters**

* (required) `sdtype`: A string describing the statistical data type.\
  Common types are `'boolean'`, `'categorical'`, `'datetime'`, `'numerical'` and `'id'`. But other types such as `'phone_number'` are also available (see [SDTypes](https://docs.sdv.dev/sdv/concepts/metadata/sdtypes)).
* `table_name`: The name of the table. *This is required if you have multiple tables.*
* `<other properties>`: Based on the sdtype, provide other parameters. For more information, see the [Metadata Spec](https://docs.sdv.dev/sdv/concepts/metadata).

**Output** A list of strings, with the column names that match the criteria. If no columns match the criteria, then an empty string will be returned.

```python
metadata.get_column_names(sdtype='unknown', table_name='products')
```

```python
[ 'product_id', 'product_code_name', 'code_type']
```

## Validation API

### validate

Use this to validate that the metadata is written according to the specification. This function will throw descriptive errors if there is anything wrong with the metadata.

**Parameters** (None)

**Output** (None)&#x20;

```python
metadata.validate()
```

```
InvalidMetadataError: The metadata is not valid

Error: Invalid values ("pii") for datetime column "start_date".
Error: Invalid regex format string "[A-{6}" for id column "hotel_id"
```

### validate\_data

Use this method to validate that the metadata accurately describes a particular dataset. This function will throw descriptive errors if there is any mismatch between the metadata and data.

**Parameters:**

* (required) `data`: A dictionary containing your multi-table data. Each key should be the name of a table and the value should be a pandas.DataFrame containing its data. The data should have the same tables and columns as described in the metadata.

**Output** (None)

```python
metadata.validate_data(data={
    'hotels': hotels_dataframe,
    'guests': guests_dataframe
})
```

### validate\_table

Use this method to validate that the metadata accurately describes a single data table. This function will throw descriptive errors if there is any mismatch between the metadata and data.

**Parameters:**

* (required) `data`: A pandas.DataFrame containing data. The data should have the same columns as described in the metadata.
* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None)

```python
metadata.validate_table(data=my_dataframe)
```

**Output** (None)

## Update API

It is important to verify and update any inaccuracies in the metadata

### update\_column

Use this method to modify the information about a column in your metadata

**Parameters**

* (required) `column_name`:  The name of the column to update
* (required) `sdtype`: A string describing the statistical data type.\
  Common types are `'boolean'`, `'categorical'`, `'datetime'`, `'numerical'` and `'id'`. But other types such as `'phone_number'` are also available. For more information, see [SDTypes docs](https://docs.sdv.dev/sdv/concepts/metadata/sdtypes).
* `table_name`: The name of the table. *This is required if you have multiple tables.*
* `<other properties>`: Based on the sdtype, provide other parameters. For more information, see [SDTypes docs](https://docs.sdv.dev/sdv/concepts/metadata/sdtypes).

**Output** (None)

```python
metadata.update_column(
    column_name='start_date',
    sdtype='datetime',
    table_name='guests',
    datetime_format='%Y-%m-%d')
    
metadata.update_column(
    column_name='user_cell',
    sdtype='phone_number',
    table_name='guests',
    pii=True)
```

### update\_columns

Use this function to make a bulk update to multiple columns at once. This function will allow you to set the same parameters for a group of columns.

**Parameters**

* (required) `column_names`: A list of strings representing the column names to update. All columns must be in the table.
* (required) `sdtype`: A string describing the statistical data type.\
  Common types are `'boolean'`, `'categorical'`, `'datetime'`, `'numerical'` and `'id'`. But other types such as `'phone_number'` are also available (see [SDTypes](https://docs.sdv.dev/sdv/concepts/metadata/sdtypes)).
* `table_name`: The name of the table. *This is required if you have multiple tables.*
* `<other properties>`: Based on the sdtype, provide other parameters

**Output** (None)

```python
metadata.update_columns(
    column_names=['age', 'transactions', 'session_length'],
    sdtype='numerical',
    table_name='users',
    computer_representation='Float'
)
```

### update\_columns\_metadata

Use this function to make a bulk update to multiple columns at once. This function will allow you to set the different parameters for each column

**Parameters**

* (required) `column_metadata`: A dictionary mapping each column name you want to update to the metadata information for that column. All columns must be in the table. For the exact format, see the [Metadata Spec](https://docs.sdv.dev/sdv/concepts/metadata).
* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None)

```python
metadata.update_columns_metadata(
    column_metadata={
        'age': { 'sdtype': 'numerical' },
        'ssn': { 'sdtype': 'ssn', 'pii': True },
        'gender': { 'sdtype': 'categorical' },
        'dob': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
        ...
    },
    table_name='users',
)
```

### add\_column

Use this function to add a column to your metadata object.

**Parameters**

* (required) `column_name` : Name of the column to be added
* (required) `sdtype`: A string describing the statistical data type. Common types are `'boolean'`, `'categorical'`, `'datetime'`, `'numerical'` and `'id'`. Other types such as `'phone_number'` are also available (see [SDTypes](https://docs.sdv.dev/sdv/~/changes/T3ZD1DOoRUEqkmrAGBZp/reference/metadata-spec/sdtypes)).
* `table_name`: The name of the table. *This is required if you have multiple tables.*
* `**kwargs`: Any other parameters you need that describe metadata for a column.

**Output** (None)

```python
metadata.add_column(
  column_name='cell_phone_numbers',
  sdtype='phone_number',
  table_name='users',
  pii=True
)
```

### remove\_column

Use this function to remove a column from your metadata object.&#x20;

**Parameters**

* (required) `column_name`: The name of the column to bedeleted
* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None). This removes the column anywhere it appears in the metadata — in a relationship, as a primary key, etc.

```python
metadata.remove_column(
    column_name='cell_phone_numbers',
    table_name='users'
)
```

### add\_column\_relationship

Use this function to specify when a group of columns within the same table represent the same concept.

{% hint style="info" %}
While anyone can add column relationships to their data, SDV Enterprise users will see the highest quality data for the relationships. To learn more about the SDV Enterprise and its extra features, [visit our website](https://datacebo.com/pricing/).
{% endhint %}

**Parameters**

* (required)  `relationship_type`: A string with the type of relationship. This represents a higher level concept. See the tabs below for options.
* (required) `column_names`: A list of column names that are part of that relationship. Make sure that these columns are compatible with the relationship type. See the tabs below for more information.
* `table_name`: The name of the table. *This is required if you have multiple tables.*

{% tabs %}
{% tab title="address" %}
An address is defined by 2 or more columns that have the following sdtypes: `country_code`, `administrative_unit`, `state`, `state_abbr`, `city`, `postcode`, `street_address` and `secondary_address`.

```python
metadata.add_column_relationship(
    relationship_type='address',
    column_names=['addr_line1', 'addr_line2', 'city', 'zipcode', 'state']
)
```

{% endtab %}

{% tab title="gps" %}
A GPS coordinate pair is defined by 2 columns:&#x20;

* sdtype `latitude` &
* sdtype `longitude`

```python
metadata.add_column_relationship(
    relationship_type='gps',
    column_names=['location_lat', 'location_lon']
)
```

{% endtab %}

{% tab title="More coming soon!" %}
Additional column relationships coming soon!

*Do you have a request for a type of column relationship? Please* [*file a feature request*](https://github.com/sdv-dev/SDV/issues/new/choose) *describing your use case.*
{% endtab %}
{% endtabs %}

**Output** (None)

### set\_primary\_key

Use this function to set the primary key of the table. Any existing primary keys will be removed.

{% hint style="info" %}
The primary key uniquely identifies every row in the table. When you set a primary key, the SDV will guarantee that every value in the table is unique. At this time, the SDV does not support composite keys.
{% endhint %}

**Parameters**

* (required) `column_name`: The column name of the primary key. The column name must already be defined in the metadata and it must be an ID or another PII sdtype.\
  \&#xNAN;*＊If the table has a composite key, you may provide a list of column names instead; in this case, at least 1 of the columns in the composite key must be an ID or another PII sdtype. Only SDV Enterprise users can create synthesizers with composite keys.*
* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None)

```python
metadata.set_primary_key(
    column_name='hotel_id',
    table_name='hotels',
)
```

### remove\_primary\_key

Use this function to remove any existing primary keys in a table.

**Parameters**

* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None) The primary key will be removed. Any existing relationships that use the primary key will be removed too.

```python
metadata.remove_primary_key(table_name='guests')
```

### add\_alternate\_keys

Use this function to set alternate keys of the table. This method will add to any existing alternate keys you may have.

{% hint style="info" %}
Similar to primary keys, alternate keys are also unique in your table. However, other tables do not reference alternate keys.
{% endhint %}

**Parameters**

* (required) `column_names`: A list of column names that represent the alternate keys in the table. All column names must already be defined in the metadata and they must be IDs or other PII sdtypes.\
  ＊ *If the table has a composite key, you may provide a list of lists; at least 1 of the columns in the composite key must be an ID or another PII sdtype. Only SDV Enterprise users can create synthesizers with composite keys.*
* `table_name`: The name of the table. *This is required if you have multiple tables.*

**Output** (None)

```python
metadata.add_alternate_keys(
    column_names=['credit_card_number'],
    table_name='guests',
)
```

### add\_relationship

Use this method to add a relationship between 2 connected tables: A parent and child table. The parent table contains the primary key references while the child table has rows that refer to its parent. Multiple child rows can refer to the same parent row.

**Parameters:**

* (required) `parent_table_name`: The name of the parent table
* (required) `child_table_name`: The name of the child table that refers to the parent
* (required) `parent_primary_key`: The primary key column in the parent table. This column uniquely identifies each row in the parent table .\
  ＊ *If the table has a composite key, you may provide a list of column names instead. Only SDV Enterprise users can create synthesizers with composite keys.*
* (required) `child_foreign_key`: The foreign key column in the child table. The values in this column contain a reference to a row in the parent table\
  ＊ *If the table has a composite key, you may provide a list of column names instead. The length of this list should match the the parent primary key. Only SDV Enterprise users can create synthesizers with composite keys.*

**Output** (None)

```python
metadata.add_relationship(
    parent_table_name='hotels',
    child_table_name='guests',
    parent_primary_key='hotel_id',
    child_foreign_key='hotel_id'
)
```

### remove\_relationship

Use this method to remove the connection between a parent and child table. In the case where there are multiple connections, this method will remove all the connections. Use this if the metadata has incorrectly detected relationships.

**Parameters:**

* (required) `parent_table_name`: The name of the parent table
* (required) `child_table_name`: The name of the child table that refers to the parent

**Output** (None)

```python
metadata.remove_relationship(
    parent_table_name='hotels',
    child_table_name='guests'
)
```

### add\_table

Use this method to add a new table to your metadata.

**Parameters:**

* (required) `table_name`: The name of the table to add.

**Output** (None)  After the table added, the new table will be empty, so be sure to add columns to it using the [add\_column](#add_column) function.

```python
metadata.add_table(
    table_name='travel_details'
)
```

### remove\_table

Use this method to remove an existing table from your metadata.

**Parameters:**

* (required) `table_name`: The name of the table to remove

**Output** (None) The table will be removed, as well as any relationships that included the table.

```
metadata.remove_table(
    table_name='travel_details'
)
```

## Adding multi-sequence information

If your data includes multiple sequences, use these methods to add information about them.

### set\_sequence\_key

Use this function to set the sequence key of your table. Any existing sequence keys will be removed.

{% hint style="info" %}
The **sequence key** is a column that identify which row(s) belong to which sequences. This is usually an ID column but it may also be a PII sdtype (such as `"phone_number"`).  At this time, SDV does not support composite keys.

This is important for tables that contain multiple sequences. In our example, the sequence key is `'Patient ID'` because this column is used to break up the sequences.

If you don't supply a sequence key, the SDV assumes that your table only contains a single sequence. *Note: The SDV sequential models do not fully support single sequence data.*
{% endhint %}

**Parameters**

* (required) `column_name`: The column name of the sequence key. The column name must already be defined in the metadata and it must be an ID or another PII sdtype.

**Output** (None)

```python
metadata.set_sequence_key(column_name='Patient ID')
```

### set\_sequence\_index

Use this function to set the sequence index of your table. Any existing sequence indices will be removed.

{% hint style="info" %}
The **sequence index** determines the spacing between the rows in a sequence. Use this if you have an explicit index such as a timestamp. If you don't supply a sequence index, the SDV assumes there is equal spacing of an unknown unit.
{% endhint %}

**Parameters**

* (required) `column_name`: The column name of the sequence index. The column name must already be defined in the metadata. It must be either a numerical or datetime column.

**Output** (None)

```python
metadata.set_sequence_index(column_name='Time')
```

## Saving, Loading & Sharing Metadata

You can save the metadata object as a JSON file and load it again for future use.

### save\_to\_json

Use this to save the metadata object to a new JSON file that will be compatible with SDV 1.0 and beyond. We recommend you write the metadata to a new file every time you update it.

**Parameters**

* (required) `filepath`: The location of the file that will be created with the JSON metadata
* `mode`: A string describing the mode to use when creating the JSON file
  * (default) `'write'`: Write the metadata to the file, raising an error if the file already exists
  * `'overwrite'`: Write the metadata to the file, replacing the contents if the file already exists

**Output** (None)&#x20;

```python
metadata.save_to_json(filepath='metadata.json')
```

### load\_from\_json

Use this method to load your file as a Metadata object.

**Parameters**

* (required) `filepath`: The name of the file containing the JSON metadata

**Output:** A Metadata object.

### load\_from\_dict

Use this class method to load a Python dictionary as a `Metadata` object.

#### Parameters

* (required) `metadata_dict`: A Python dictionary representation of the metadata.

**Output** A Metadata object

```python
from sdv.metadata import Metadata

metadata_obj = Metadata.load_from_dict(metadata_dict)
```

### anonymize

Use this method to anonymize the column names of your metadata. This makes it easier to share your metadata, eg. for debugging purposes.

**Parameters** (None)

**Output** A new Metadata object that represents the anonymized metadata

```python
anonymized_metadata = original_metadata.anonymize()
```

{% hint style="info" %}
**The anonymized metadata contains new column names.** The original names are obfuscated, but the sdtypes and other formatting information remains the same.

```python
>>> anonymized_metadata.to_dict()
{
    'tables': {
        '3oc2d': {
            'primary_key': 'id_0',
            'columns': {
                'id_0': { 'sdtype': 'id', 'regex_format': 'ID_[0-9]{10}' },
                'num_0': { 'sdtype': 'numerical' },
                'num_1': { 'sdtype': 'numerical' },
                'cat_0': { 'sdtype': 'categorical' },
                'dt_0': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
                'pii_0': { 'sdtype': 'ssn' },
                ...
            }
        }
    }
}
```

{% endhint %}

### copy

Use this method to create a copy of your metadata in its current state.

**Parameters** (None)

**Output** A new Metadata object that is a copy of the current metadata. Note that the originsl metadata and the copy are separate objects, so modifying one will not modify the other.

```python
metadata_copy = metadata.copy()
```

(default) `'full'`	Show all the different column names, primary keys and foreign keys
`'summarized'`	Summarize the columns based on the data type
`None`	Hide the details. Only show the table name.

(default) `True`	Label each relationship between 2 tables with the column names
`False`	Do not label the relationships. Only show an arrow between tables.