# Cleaning Your Data

Use the utility functions below to clean your sequential data for fast and effective modeling.

### get_random_sequence_subset

Use this function to subsample data from your dataset. Given multi-sequence data, this function will randomly select sequences and clip them to the desired length.

**Parameters**

(required)

`data`

: A pandas.DataFrame containing your multi-sequence data(required)

`metadata`

: A Metadata object that describes the data. The metadata must describe multi-sequence data, meaning that it must have a sequence key specified.(required)

`num_sequences`

: An int describing the number of sequences to subsample from the data`max_sequence_length`

: The maximum length each sequence is allowed to be(default)

`None`

: Do not enforce any max length, meaning that entire sequences will appear in the subsampled data`<integer>`

: An integer describing the max sequence length. Any sequence that is longer than this value will be shortened based on the method below

`long_sequence_subsampling_method`

: The method for shortening sequences that are too long(default)

`'first_rows'`

: Keep the first*n*rows of each sequence as they appear, where*n*is the max sequence length`'last_rows'`

: Keep the last*n*rows of each sequence as they appear, where*n*is the max sequence length`'random'`

: Randomly choose*n*rows of each sequence, where*n*is the max sequence length. Note the randomly chosen rows will still appear in the same order as the original data.

**Output **A dataset with fewer rows than before. The dataset will continue to represent multiple sequences of potentially varying lengths.

Last updated