OptimizedTimestampEncoder

Compatibility: datetime data

The OptimizedTimestampEncoder transforms data that represents dates and times into numerical values. The transformed value is a number that represents the datetime. It is optimized to take up the least memory based on your unique dataset, and can only be understood by the transformer.

from rdt.transformers.datetime import OptimizedTimestampEncoder
transformer = OptimizedTimestampEncoder()

Parameters

missing_value_replacement: Add this argument to replace missing values during the transform phase

(default) 'random'

Replace missing values with a random value. The value is chosen uniformly at random from the min/max range.

'mean'

Replace all missing values with the average value.

'mode'

Replace all missing values with the most frequently occurring value

None

Deprecated. Do not replace missing values. The transformed data will continue to have missing values.

(deprecated) model_missing_values: Use the missing_value_generation parameter instead.

missing_value_generation: Add this argument to determine how to recreate missing values during the reverse transform phase

(default) 'random'

Randomly assign missing values in roughly the same proportion as the original data.

'from_column'

Create a new column to store whether the value should be missing. Use it to recreate missing values. Note: Adding extra columns uses more memory and increases the RDT processing time.

None

Do not recreate missing values.

enforce_min_max_values: Add this argument to allow the transformer to learn the min and max allowed values from the data.

(default) False

Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present.

True

Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value.

datetime_format: Add this argument to tell the transformer how to read your datetime column if it's in a specific format that isn't easy to identify.

(default) None

Automatically detect the format. The transformer is able to detect common format such as "02/15/22", "15/02/22 22:30", "02-15-2022 10:30PM" etc.

<string>

Read the format according to instructions in the <string>. For eg. to represent a datetime like "Feb 15, 2022 10:23:45 AM", you can use the format string: "%b %d, %Y %I:%M:%S %p". For more info, see Python's strftime module↗.

Examples

from transformers.datetime import OptimizedTimestampEncoder

transformer = OptimizedTimestampEncoder(missing_value_replacement='mean',
                                        datetime_format='%b %d, %Y %I:%M:%S %p')

FAQs

When do I need to supply a format string?

The transformer should be able to automatically detect the most common datetime formats. If you are not sure whether your format can be detected, we recommend trying it without the format string first. If you see an error, supply the format.

Particular confusion might arise if your datetime values have uncommon formats. For example:

  • You do not have leading 0's in your months or dates, such as "1/1/21" instead of "01/01/21"

  • You are using something other that hyphens, dashes or colons to separate out the date & time components. Such as "[Jan][1][2021][12:34]".

Should I replace missing values?

The decision to replace missing values is based on how you plan to use your data. For example, you might be using RDT to clean your data for machine learning (ML). Check to see whether the ML techniques you plan to use allow missing values.

When is it necessary to model missing values?

When setting the model_missing_values parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.

Last updated