Cleaning Your Data
Last updated
Last updated
Use the utility functions below to clean your sequential data for fast and effective modeling.
Use this function to subsample data from your dataset. Given multi-sequence data, this function will randomly select sequences and clip them to the desired length.
Parameters
(required) data
: A pandas.DataFrame containing your multi-sequence data
(required) metadata
: A object that describes the data. The metadata must describe multi-sequence data, meaning that it must have a sequence key specified.
(required) num_sequences
: An int describing the number of sequences to subsample from the data
max_sequence_length
: The maximum length each sequence is allowed to be
(default) None
: Do not enforce any max length, meaning that entire sequences will appear in the subsampled data
<integer>
: An integer describing the max sequence length. Any sequence that is longer than this value will be shortened based on the method below
long_sequence_subsampling_method
: The method for shortening sequences that are too long
(default) 'first_rows'
: Keep the first n rows of each sequence as they appear, where n is the max sequence length
'last_rows'
: Keep the last n rows of each sequence as they appear, where n is the max sequence length
'random'
: Randomly choose n rows of each sequence, where n is the max sequence length. Note the randomly chosen rows will still appear in the same order as the original data.
Output A dataset with fewer rows than before. The dataset will continue to represent multiple sequences of potentially varying lengths.