Loading Data
The SDV library contains many different demo datasets that you can use to get started. Use the
demo
module to access these datasets.The
demo
module accesses the SDV's public dataset repository. These methods require an internet connection.Use this method to get information about all the available demos in the SDV's public dataset repository.
Parameters
modality
: Set this to the string'sequential'
to see all the sequential demo datasets
Returns A pandas DataFrame object containing the name of the dataset, its size (in MB) and the number of tables it contains.
The SDV currently only sequential data that is present in a single table. If you use the
'sequential'
modality, the number of tables is always 1.from sdv.datasets.demo import get_available_demos
get_available_demos(modality='sequential')
dataset_name size_MB num_tables
ArticularyWordRecognition 8.8 1
AtrialFibrillation 0.627 1
BasicMotions 0.741 1
... ... ...
Use this method to download a demo dataset from the SDV's public dataset repository.
Parameters
- (required)
modality
: Set this to the string'sequential'
to access sequential demo data - (required)
dataset_name
: A string with the name of the demo dataset. You can use any of the dataset names from theget_available_demo
method. output_folder_name
: A string with the name of a folder. If provided, this method will download the data and metadata into the folder, in addition to returning the data.- (default)
None
: Do not save the data into a folder. The data will still be returned so that you can use it in your Python script.
Output A tuple
(data, metadata)
.The
data
is a pandas DataFrame containing the demo data and the metadata
is a SingleTableMetadata object the describes the data.from sdv.datasets.demo import download_demo
data, metadata = download_demo(
modality='sequential',
dataset_name='ArticularyWordRecognition',
output_folder_name='sdv_demo_datasets/word_data/'
)
A local dataset is a dataset that you have already downloaded onto your computer. These do not require any internet connectivity to access.
Use this method to load any datasets that are stored as CSVs.
Parameters
- (required)
folder_name
: A string with the name of the folder where the datasets are stored read_csv_parameters
: A dictionary with any optional parameters needed for loading the CSV data. Set thepandas.read_csv
documentation for a full list of options.
Returns A dictionary that contains all the CSV data found in the folder. The key is the name of the file (without the
.csv
suffix) and the value is a pandas DataFrame containing the data.from sdv.datasets.local import load_csvs
# assume that my_folder contains 1 CSV file named 'patient_data.csv'
datasets = load_csvs(folder_name='my_folder/')
patient_object = datasets['patient_data']
Where's the metadata? If you're loading your own datasets, please create and load in your metadata separately. See the Sequential Metadata API guide for more details.
The SDV uses the pandas library for data manipulation and synthesizing. If your data is in any other format, load it in as a pandas.DataFrame object to use in the SDV.
Pandas offers many methods to load in different types of data. For example: Excel file, SQL table or JSON string.
import pandas as pd
data = pd.read_excel('file://localhost/path/to/table.xlsx')
Last modified 7mo ago