Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  1. Sequential Data
  2. Data Preparation

Cleaning Your Data

PreviousLoading DataNextCreating Metadata

Last updated 7 months ago

Use the utility functions below to clean your sequential data for fast and effective modeling.

get_random_sequence_subset

Use this function to subsample data from your dataset. Given multi-sequence data, this function will randomly select sequences and clip them to the desired length.

from sdv.utils import get_random_subset

subsampled_data = get_random_subset(
    data, 
    metadata,
    num_sequences=100
)

Parameters

  • (required) data: A pandas.DataFrame containing your multi-sequence data

  • (required) metadata: A object that describes the data. The metadata must describe multi-sequence data, meaning that it must have a sequence key specified.

  • (required) num_sequences: An int describing the number of sequences to subsample from the data

  • max_sequence_length: The maximum length each sequence is allowed to be

    • (default) None: Do not enforce any max length, meaning that entire sequences will appear in the subsampled data

    • <integer>: An integer describing the max sequence length. Any sequence that is longer than this value will be shortened based on the method below

  • long_sequence_subsampling_method: The method for shortening sequences that are too long

    • (default) 'first_rows': Keep the first n rows of each sequence as they appear, where n is the max sequence length

    • 'last_rows': Keep the last n rows of each sequence as they appear, where n is the max sequence length

    • 'random': Randomly choose n rows of each sequence, where n is the max sequence length. Note the randomly chosen rows will still appear in the same order as the original data.

Output A dataset with fewer rows than before. The dataset will continue to represent multiple sequences of potentially varying lengths.

Metadata