Cleaning Your Data

Use the utility functions below to clean your sequential data for fast and effective modeling.

get_random_sequence_subset

Use this function to subsample data from your dataset. Given multi-sequence data, this function will randomly select sequences and clip them to the desired length.

from sdv.utils import get_random_subset

subsampled_data = get_random_subset(
    data, 
    metadata,
    num_sequences=100
)

Parameters

(required) data: A pandas.DataFrame containing your multi-sequence data
(required) metadata: A Metadata object that describes the data. The metadata must describe multi-sequence data, meaning that it must have a sequence key specified.
(required) num_sequences: An int describing the number of sequences to subsample from the data
max_sequence_length: The maximum length each sequence is allowed to be
- (default) None: Do not enforce any max length, meaning that entire sequences will appear in the subsampled data
- <integer>: An integer describing the max sequence length. Any sequence that is longer than this value will be shortened based on the method below
long_sequence_subsampling_method: The method for shortening sequences that are too long
- (default) 'first_rows': Keep the first n rows of each sequence as they appear, where n is the max sequence length
- 'last_rows': Keep the last n rows of each sequence as they appear, where n is the max sequence length
- 'random': Randomly choose n rows of each sequence, where n is the max sequence length. Note the randomly chosen rows will still appear in the same order as the original data.

Output A dataset with fewer rows than before. The dataset will continue to represent multiple sequences of potentially varying lengths.

PreviousLoading Data NextCreating Metadata

Last updated 10 months ago