Loading Data

Demo Data

The SDV library contains many different demo datasets that you can use to get started. Use the demo module to access these datasets.

get_available_demos

Use this method to get information about all the available demos in the SDV's public dataset repository.

Parameters

  • modality: Set this to the string 'single_table' to see all the single table demo datasets

Returns A pandas DataFrame object containing the name of the dataset, its size (in MB) and the number of tables it contains.

from sdv.datasets.demo import get_available_demos

get_available_demos(modality='single_table')
dataset_name        size_MB        num_tables
adult               3.6            1
alarm               4.6            1
census              141.2          1
...                 ...            ...

download_demo

Use this method to download a demo dataset from the SDV's public dataset repository.

Parameters

  • (required) modality: Set this to the string 'single_table' to access single table demo data

  • (required) dataset_name: A string with the name of the demo dataset. You can use any of the dataset names from the get_available_demo method.

  • output_folder_name: A string with the name of a folder. If provided, this method will download the data and metadata into the folder, in addition to returning the data.

    • (default) None: Do not save the data into a folder. The data will still be returned so that you can use it in your Python script.

Output A tuple (data, metadata).

The data is a pandas DataFrame containing the demo data and the metadata is a Metadata object the describes the data.

get_source

Some datasets have a source file that describes where the dataset comes from. This can include information like a URL, citations for the original publication, and other information that tracks the dataset's provenance. Use this function to get all this source information.

Parameters

  • (required) modality: Set this to the string 'single_table' to access single table demo data

  • (required) dataset_name: A string with the name of the demo dataset. You can use any of the dataset names from the get_available_demo method.

  • output_filepath: A string with the name of a file path. If provided, this method will create the file and write the source information to the file, in addition to returning it.

    • (default) None: Do not save the source information into a file. The contents will still be returned so that you can print it out and read it.

Output A string containing the contents of the source information. You can print it out to read it. (If no source information is available for a dataset, the function returns None and no file will be written.)

get_readme

Some datasets have a README file that describes more information about what the dataset means. This could include explanations for naming conventions used in the dataset, mappings for ID codes, or business logic. Use this function to get the README (if it exists).

README information is coming soon! At this time, SDV demo datasets do not contain any README information. If you're looking for more information about the dataset, we recommend getting the source. From there, you'll be able to navigate to any URLs or contact the original authors as needed.

Parameters

  • (required) modality: Set this to the string 'single_table' to access single table demo data

  • (required) dataset_name: A string with the name of the demo dataset. You can use any of the dataset names from the get_available_demo method.

  • output_filepath: A string with the name of a file path. If provided, this method will create the file and write the README information to the file, in addition to returning it.

    • (default) None: Do not save the README information into a file. The contents will still be returned so that you can print it out and read it.

Output A string containing the contents of the README information. You can print it out to read it. (If no README information is available for a dataset, the function returns None and no file will be written.)

Loading your own (local) datasets

A local dataset is a dataset that you have already downloaded onto your computer. These do not require any internet connectivity to access.

load_csvs

Use this method to load any datasets that are stored as CSVs.

Parameters

  • (required) folder_name: A string with the name of the folder where the datasets are stored

  • read_csv_parameters: A dictionary with additional parameters to use when reading the CSVs. The keys are any of the parameter names of the pands.read_csv function and the values are your inputs.

Returns A dictionary that contains all the CSV data found in the folder. The key is the name of the file (without the .csv suffix) and the value is a pandas DataFrame containing the data.

Where's the metadata? If you're loading your own datasets, please create and load in your metadata separately. See the Metadata guide for more details.

Do you have data in other formats?

The SDV uses the pandas library for data manipulation and synthesizing. If your data is in any other format, load it in as a pandas.DataFrame object to use in the SDV.

Pandas offers many methods to load in different types of data. For example: Excel file, SQL table or JSON string.

For more options, see the pandas reference.

Last updated