Cleaning Your Data
Use the utility functions below to clean your sequential data for fast and effective modeling.
get_random_sequence_subset
Use this function to subsample data from your dataset. Given multi-sequence data, this function will randomly select sequences and clip them to the desired length.
Parameters
(required)
data
: A pandas.DataFrame containing your multi-sequence data(required)
metadata
: A Metadata object that describes the data. The metadata must describe multi-sequence data, meaning that it must have a sequence key specified.(required)
num_sequences
: An int describing the number of sequences to subsample from the datamax_sequence_length
: The maximum length each sequence is allowed to be(default)
None
: Do not enforce any max length, meaning that entire sequences will appear in the subsampled data<integer>
: An integer describing the max sequence length. Any sequence that is longer than this value will be shortened based on the method below
long_sequence_subsampling_method
: The method for shortening sequences that are too long(default)
'first_rows'
: Keep the first n rows of each sequence as they appear, where n is the max sequence length'last_rows'
: Keep the last n rows of each sequence as they appear, where n is the max sequence length'random'
: Randomly choose n rows of each sequence, where n is the max sequence length. Note the randomly chosen rows will still appear in the same order as the original data.
Output A dataset with fewer rows than before. The dataset will continue to represent multiple sequences of potentially varying lengths.
Last updated