1.1 Create Metadata and Extract Training Data
In this step, you'll gather the real data that you'll want to use for training an SDV synthesizer. We'll call this training data. Additionally, you'll create some SDV metadata
that describes the structure of the data.
To accomplish this, you have two options —
Export data from your database as csv
files or aggregate the files in a folder if your data is already available as local files. And follow steps under Option A.
This option is recommended if your database is supported by DataCebo's AI Connectors. Follow steps under Option B.
Option A: Import with CSVs (Recommended for POC)
B1: Export your data. Start by manually exporting CSVs for each table of your database. Export a larger number of rows (~10,000) because we may need to drop some rows in later steps. Place all your CSV files in a single folder, for example a my_data/
folder. We recommend naming each CSV file with the name of the table. For example, an export from the Users table would be called users.csv
, as shown below:
my_data/
|--- users.csv
|--- transactions.csv
|--- sessions.csv
B2: Create metadata: Now you can read the folder of CSVs into Python, and use the metadata auto-detection tool to automatically infer SDV metadata from it.
The generated metadata is not guaranteed to be accurate (especially with a basic CSV export)! We recommend visualizing and updating the metadata to make sure that it accurately describes your data. For more information, please see the Metadata docs.
When you're done, validate the metadata, save it into a file, and share it with DataCebo.
Metadata does not contain real data. It includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).
B3: Create a valid dataset. Because the original data contains a random export from your database, it is not guaranteed to be referentially sound. For example, it may contain unknown links or references between tables. Use our utility function, drop_unknown_references, to clean this up.
During this step, you'll notice that SDV will remove any rows that contain unknown references. As a result, the overall size of the dataset will be smaller. If this step is successful, the metadata should successfully validate against the data.
Option B: Directly connect to your database
This is the recommended option if your database is support by SDV. If your database is supported by DataCebo's AI Connectors, please request access to this feature.
A1: Install AI Connectors. DataCebo will provide you with a username/license key combo that you can use to install AI Connectors. Specific instructions are available in each AI Connector's page.
A2: Create Metadata. With AI Connectors, you can directly connect to your database to SDV. The connector will be able to create metadata in the format that SDV requires. Specific instructions are available in each AI Connector's page.
We recommend visualizing and updating the metadata to make sure that it accurately describes your data. For more information, please see the Metadata docs.
When you're done, validate the metadata, save it into a file, and share it with DataCebo.
metadata.validate()
metadata.save_to_json('my_metadata.json') # share this file with DataCebo
Metadata does not contain real data. It includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).
A3: Import training data from the database. Use the connector and metadata to import a training set. You provide the name of your main table (containing the most important entity) and the number of rows. We recommend ~2500 rows. Specific instructions are available in each AI Connector's page.
SDV will produce a dictionary of pandas.DataFrames corresponding to each table. SDV guarantees that the training dataset is referentially sound. The metadata should successfully validate against the data that you just imported.
Share metadata json with DataCebo team
Once you validate the metadata json file you created using either of the options above, share this file with the DataCebo team.
Metadata does not contain real data. It includes structural information such as column names, types, and relationships (commonly referred to as schema in database terminology).
Are you ready for the next step?
By the end of this step you should have the following:
Last updated