1.1 Create Metadata and Extract Training Data

In this step, you'll gather the real data that you'll want to use for training an SDV synthesizer. We'll call this training data. Additionally, you'll create some SDV metadata that describes the structure of the data.

To accomplish this, you have two options —

Option A: Import with CSVs (Recommended for POC)

Option B: Directly connect to your database.

Export data from your database as csv files or aggregate the files in a folder if your data is already available as local files. And follow steps under Option A.

This option is recommended if your database is supported by DataCebo's AI Connectors. Follow steps under Option B.

Option A: Import with CSVs (Recommended for POC)

A1: Export your data. Start by manually exporting CSVs for each table of your database. Export a larger number of rows (~10,000) because we may need to drop some rows in later steps. Place all your CSV files in a single folder, for example a my_data/ folder. We recommend naming each CSV file with the name of the table. For example, an export from the Users table would be called users.csv, as shown below:

my_data/
|--- users.csv
|--- transactions.csv
|--- sessions.csv

A2: Create metadata: Now you can read the folder of CSVs into Python, and use the metadata auto-detection tool to automatically infer SDV metadata from it.

The generated metadata is not guaranteed to be accurate (especially with a basic CSV export)! We recommend visualizing and updating the metadata to make sure that it accurately describes your data. For more information, please see the Metadata docs.

When you're done, validate the metadata, save it into a file, and share it with DataCebo.

Metadata does not contain real data. It includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).

A3: Create a valid dataset. Because the original data contains a random export from your database, it is not guaranteed to be referentially sound. For example, it may contain unknown links or references between tables. Use our utility function, drop_unknown_references, to clean this up.

During this step, you'll notice that SDV will remove any rows that contain unknown references. As a result, the overall size of the dataset will be smaller. If this step is successful, the metadata should successfully validate against the data.

A4: Estimate parameters from the real data. Now, you should be able to estimate basic parameters based on the data and metadata. These parameters will ensure that the "fake" data that we test with is realistic. Save the parameters in a file and share it with DataCebo.

Option B: Directly connect to your database

This is the recommended option if your database is support by SDV. If your database is supported by DataCebo's AI Connectors, please request access to this feature.

If your database is not supported by DataCebo's AI Connectors, let us know about your database flavor so that we can add it to our future roadmap. In the meantime, you can proceed using CSVs

B1: Install AI Connectors. DataCebo will provide you with a username/license key combo that you can use to install AI Connectors. Specific instructions are available in each AI Connector's page.

B2: Create Metadata. With AI Connectors, you can directly connect to your database to SDV. The connector will be able to create metadata in the format that SDV requires. Specific instructions are available in each AI Connector's page.

We recommend visualizing and updating the metadata to make sure that it accurately describes your data. For more information, please see the Metadata docs.
When you're done, validate the metadata, save it into a file, and share it with DataCebo.

metadata.validate()
metadata.save_to_json('my_metadata.json') # share this file with DataCebo

Metadata does not contain real data. It includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).

B3: Import training data from the database. Use the connector and metadata to import a training set. You provide the name of your main table (containing the most important entity) and the number of rows. We recommend ~2500 rows. Specific instructions are available in each AI Connector's page.

SDV will produce a dictionary of pandas.DataFrames corresponding to each table. SDV guarantees that the training dataset is referentially sound. The metadata should successfully validate against the data that you just imported.

B4: Estimate parameters from the real data. Now, you should be able to estimate basic parameters based on the data and metadata. These parameters will ensure that the "fake" data that we test with is realistic. Save the parameters in a file and share it with DataCebo.

Using either of the options above, you'll create a metadata json file and a parameters JSON file. Share both files with the DataCebo team.

Metadata and parameters do not contain real data.

Metadata includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).

The parameters contain basic information that can be used to create realistic data. For example, it includes the min/max ranges and category values that are possible within each individual column. It does not contain any other statistical information describing the shape of your data, correlations, etc.

Are you ready for the next step?

By the end of this step you should have the following:

A referentially sound dataset that can be used for training an SDV synthesizer
SDV metadata that accurately describes the data (verified by you!)
A JSON file that contains your metadata, and is shared with DataCebo
A JSON file that contains parameters, and is shared with DataCebo

PreviousStep 1 Next1.2 Generate and Test with "Fake" Data

Last updated 1 month ago

Option A: Import with CSVs (Recommended for POC)

Option B: Directly connect to your database

Share metadata and parameters JSON with DataCebo team

Are you ready for the next step?