Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Demo Data
  • get_available_demos
  • download_demo
  • Loading your own local datasets
  • load_csvs
  • Do you have data in other formats?
  1. Sequential Data
  2. Data Preparation

Loading Data

Demo Data

The SDV library contains many different demo datasets that you can use to get started. Use the demo module to access these datasets.

The demo module accesses the SDV's public dataset repository. These methods require an internet connection.

get_available_demos

Use this method to get information about all the available demos in the SDV's public dataset repository.

Parameters

  • modality: Set this to the string 'sequential' to see all the sequential demo datasets

Returns A pandas DataFrame object containing the name of the dataset, its size (in MB) and the number of tables it contains.

The SDV currently only sequential data that is present in a single table. If you use the 'sequential' modality, the number of tables is always 1.

from sdv.datasets.demo import get_available_demos

get_available_demos(modality='sequential')
dataset_name                    size_MB        num_tables
ArticularyWordRecognition       8.8            1
AtrialFibrillation              0.627          1
BasicMotions                    0.741          1
...                             ...            ...

download_demo

Use this method to download a demo dataset from the SDV's public dataset repository.

Parameters

  • (required) modality: Set this to the string 'sequential' to access sequential demo data

  • (required) dataset_name: A string with the name of the demo dataset. You can use any of the dataset names from the get_available_demo method.

  • output_folder_name: A string with the name of a folder. If provided, this method will download the data and metadata into the folder, in addition to returning the data.

    • (default) None: Do not save the data into a folder. The data will still be returned so that you can use it in your Python script.

Output A tuple (data, metadata).

from sdv.datasets.demo import download_demo

data, metadata = download_demo(
    modality='sequential',
    dataset_name='ArticularyWordRecognition',
    output_folder_name='sdv_demo_datasets/word_data/'
)

Loading your own local datasets

A local dataset is a dataset that you have already downloaded onto your computer. These do not require any internet connectivity to access.

load_csvs

Use this method to load any datasets that are stored as CSVs.

Parameters

  • (required) folder_name: A string with the name of the folder where the datasets are stored

Returns A dictionary that contains all the CSV data found in the folder. The key is the name of the file (without the .csv suffix) and the value is a pandas DataFrame containing the data.

from sdv.datasets.local import load_csvs

# assume that my_folder contains 1 CSV file named 'patient_data.csv'
datasets = load_csvs(folder_name='my_folder/')

patient_object = datasets['patient_data']

Do you have data in other formats?

import pandas as pd

data = pd.read_excel('file://localhost/path/to/table.xlsx')
PreviousData PreparationNextCleaning Your Data

Last updated 7 months ago

The data is a containing the demo data and the metadata is a object the describes the data.

read_csv_parameters: A dictionary with any optional parameters needed for loading the CSV data. Set the documentation for a full list of options.

Where's the metadata? If you're loading your own datasets, please create and load in your metadata separately. See the guide for more details.

The SDV uses the for data manipulation and synthesizing. If your data is in any other format, load it in as a object to use in the SDV.

Pandas offers many methods to load in different types of data. For example: , or .

For more options, see the .

pandas DataFrame
Metadata
pandas.read_csv
Sequential Metadata API
pandas library
pandas.DataFrame
Excel file
SQL table
JSON string
pandas reference