Loading Data

Demo Data

The SDV library contains many different demo datasets that you can use to get started. Use the demo module to access these datasets.

The demo module accesses the SDV's public dataset repository. These methods require an internet connection.

get_available_demos

Use this method to get information about all the available demos in the SDV's public dataset repository.

Parameters

  • modality: Set this to the string 'multi_table' to see all the multi table demo datasets

Returns A pandas DataFrame object containing the name of the dataset, its size (in MB) and the number of tables it contains.

from sdv.datasets.demo import get_available_demos

get_available_demos(modality='multi_table')
dataset_name            size_MB        num_tables
Accidents_v1            172.3          3
airbnb-simplified       371.5          2
Atherosclerosis_v1      2.9            4                     
...                     ...            ...

download_demo

Use this method to download a demo dataset from the SDV's public dataset repository.

Parameters

  • (required) modality: Set this to the string 'multi_table' to access multi table demo data

  • (required) dataset_name: A string with the name of the demo dataset. You can use any of the dataset names from the get_available_demo method.

  • output_folder_name: A string with the name of a folder. If provided, this method will download the data and metadata into the folder, in addition to returning the data.

    • (default) None: Do not save the data into a folder. The data will still be returned so that you can use it in your Python script.

Output A tuple (data, metadata).

The data is a dictionary that maps each table name to a pandas DataFrame containing the demo data for that table. The metadata is a MultiTableMetadata object the describes the data.

from sdv.datasets.demo import download_demo

data, metadata = download_demo(
    modality='multi_table',
    dataset_name='fake_hotels'
)

guests_table = data['guests']
hotels_table = data['hotels']

Loading your own local datasets

A local dataset is a dataset that you have already downloaded onto your computer. These do not require any internet connectivity to access.

load_csvs

Use this method to load any datasets that are stored as CSVs.

Parameters

  • (required) folder_name: A string with the name of the folder where the datasets are stored

  • read_csv_parameters: A dictionary with additional parameters to use when reading the CSVs. The keys are any of the parameter names of the pands.read_csv function and the values are your inputs.

Returns A dictionary that contains all the CSV data found in the folder. The key is the name of the file (without the .csv suffix) and the value is a pandas DataFrame containing the data.

from sdv.datasets.local import load_csvs

# assume that my_folder contains many CSV files
datasets = load_csvs(
    folder_name='my_folder/',
    read_csv_parameters={
        'skipinitialspace': True,
        'encoding': 'utf_32'
    })

# the data is available under the file name
guests_table = datasets['guests']
hotels_table = datasets['guests']

Where's the metadata? If you're loading your own datasets, please create and load in your metadata separately. See the Multi Table Metadata API guide for more details.

Cleaning your data

Use the utility functions below to clean your multi-table data for fast and effective multi-table modeling.

drop_unknown_references

Multi-table SDV synthesizers work best when your dataset has referential integrity, meaning that all the references in a foreign key refer to an existing value in the primary key. Use this function to drop rows that contain unknown references for your proof-of-concept synthesizer.

Parameters

  • (required) metadata: A MultiTableMetadata object

  • (required) data: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.

  • drop_missing_values: A boolean that describes whether to drop missing values in the foreign key

    • (default) True: If a foreign key has a missing value, treat it as an unknown reference and drop it. We recommend this setting for maximum efficiency with SDV.

    • False: If a foreign key has a missing value, treat it as a valid reference and keep it

  • verbose: A boolean that controls whether to print out a summary of the results

    • (default) True: Print a summary of the number of rows that are dropped from each table

Output A dictionary that maps each table name to a pandas DataFrame containing data. The data will contain referential integrity, meaning that there will be no unknown foreign key references.

from sdv.utils import poc

cleaned_data = poc.drop_unknown_references(data, metadata)
Success! All foreign keys have referential integrity. 

Table Name    # Rows (Original)    # Invalid Rows   # Rows (New)
sessions      1200                 50               1150     
transactions  5000                 0                5000

Do you have data in other formats?

The SDV uses the pandas library for data manipulation and synthesizing. If your data is in any other format, load it in as a pandas.DataFrame object to use in the SDV. For multi table data, make sure you format your data as a dictionary, mapping each table name to a different DataFrame object.

multi_table_data = {
    'table_name_1': <pandas.DataFrame>,
    'table_name_2': <pandas.DataFrame>,
    ...
}

Pandas offers many methods to load in different types of data. For example: Excel file, SQL table or JSON string.

import pandas as pd

data_table_1 = pd.read_excel('file://localhost/path/to/table_1.xlsx')
data_table_2 = pd.read_excel('file://localhost/path/to/table_2.xlsx')

For more options, see the pandas reference.

Last updated

Copyright (c) 2023, DataCebo, Inc.