OptimizedTimestampEncoder

Compatibility: datetime data

The OptimizedTimestampEncoder transforms data that represents dates and times into numerical values. The transformed value is a number that represents the datetime. It is optimized to take up the least memory based on your unique dataset, and can only be understood by the transformer.

from rdt.transformers.datetime import OptimizedTimestampEncoder
transformer = OptimizedTimestampEncoder()

Parameters

missing_value_replacement: Add this argument to replace missing values during the transform phase

(default) 'mean'

Replace all missing values with the average value.

'random'

Replace missing values with a random value. The value is chosen uniformly at random from the min/max range.

'mode'

Replace all missing values with the most frequently occurring value

None

Do not replace missing values. The transformed data will continue to have missing values.

(deprecated) model_missing_values: Use the missing_value_generation parameter instead.

missing_value_generation: Add this argument to determine how to recreate missing values during the reverse transform phase

(default) 'random'

Randomly assign missing values in roughly the same proportion as the original data.

'from_column'

Create a new column to store whether the value should be missing. Use it to recreate missing values. Note: Adding extra columns uses more memory and increases the RDT processing time.

None

Do not recreate missing values.

enforce_min_max_values: Add this argument to allow the transformer to learn the min and max allowed values from the data.

(default) False

Do not learn any min or max values from the dataset. When reverse transforming the data, the values may be above or below what was originally present.

True

Learn the min and max values from the input data. When reverse transforming the data, any out-of-bounds values will be clipped to the min or max value.

datetime_format: Add this argument to tell the transformer how to read your datetime column if it's present as a string

(default) None

Format detection isn't needed. This may be because your data is represented by pd.datetime objects. If your data is present as a string, please provide a format.

<string>

Read the format according to instructions in the <string>. For eg. to represent a datetime like "Feb 15, 2022 10:23:45 AM", you can use the format string: "%b %d, %Y %I:%M:%S %p". For more info, see Python's strftime module↗.

extract_timezone: Add this argument if your datetime column has timezone information, and you'd like to extract the timezone into a new column to consider as a separate feature.

(default) False

Do not extract the timezone. Your datetime values will be converted to numerical values based on their timezones, but the timezones themselves will not be extracted into a new column.

True

Extract the timezones into a new column. Your datetime values will be converted into numerical values based on their timezones, and the timezone values themveles will be extracted into a new column.

*SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. For more information, visit our page to Explore SDV.

Examples

Basic case: This transformer is able to parse your datetime format and convert each value into a numerical, Unix time.

from transformers.datetime import OptimizedTimestampEncoder

transformer = OptimizedTimestampEncoder(missing_value_replacement='mean',
                                        datetime_format='%d %b %Y')

Converting based on timezones: If your data contains timezones, the transformer will consider your timezone during the conversion. For example 3pm in New York is the same as 8pm in London. Both of these datetime values will be converted to the same Unix time.

How are timezones represented? This transformer is capable of understanding timezone offsets. These are typically represented as the difference (hours/minutes) from Greenwich Mean Time (UTC) which is considered as +0000. For example New York (US East Coast) is 5 hours behind UTC so it's represented as -0500, and Delhi is 4 hours 30 minutes ahead of UTC so it's represented as +0430. Be sure to represent the offset in your datetime_format string using the %z flag.

Instead of offsets, you may have timezone names such as EST or UTC. Timezones names are not globally standardized, so this transformer offers limited support for only the most common ones — UTC and GMT. Be sure to represent these in your datetime_format string using the %Z flag.

from transformers.datetime import OptimizedTimestampEncoder

transformer = OptimizedTimestampEncoder(datetime_format='%b %d, %Y %I:%M%p (%z)')
The transformer factors in the timezones when converting to Unix time. In this example, the first 2 rows represent the same exact time, but in different timezones: 3pm in New York (time zone UTC-0500) is the same as 8pm in London (timezone UTC+0000).

Extracting timezone values: In addition to the Unix time conversation, SDV Enterprise users will be able to extract the timezones into a new, categorical column that you can consider as a separate feature. This is particularly useful if your data contains multiple timezones. When reverse transforming back to the original data, this will allow you to preserve the same mix of timezones.

from transformers.datetime import OptimizedTimestampEncoder

transformer = OptimizedTimestampEncoder(
    datetime_format='%b %d, %Y %I:%M:%S %p (%z)',
    extract_timezone=True)
The transformer factors in the timezones when converting to Unix time. In this example, the first 2 rows represent the same exact time, but in different timezones: 3pm in New York (time zone UTC-0500) is the same as 8pm in London (timezone UTC+0000). Additionally, SDV Enterprise users have the option to save the original timezone value in a new column so it is not lost.

For more information about timezones, see the FAQ.

FAQs

When do I need to supply a format string?

The transformer should be able to automatically detect the most common datetime formats. If you are not sure whether your format can be detected, we recommend trying it without the format string first. If you see an error, supply the format.

Particular confusion might arise if your datetime values have uncommon formats. For example:

  • You do not have leading 0's in your months or dates, such as "1/1/21" instead of "01/01/21"

  • You are using something other that hyphens, dashes or colons to separate out the date & time components. Such as "[Jan][1][2021][12:34]".

Should I replace missing values?

The decision to replace missing values is based on how you plan to use your data. For example, you might be using RDT to clean your data for machine learning (ML). Check to see whether the ML techniques you plan to use allow missing values.

When is it necessary to model missing values?

When setting the missing_value_generation parameter, consider whether the "missingness" of the data is something important. For example, maybe the user opted out of supplying the info on purpose, or maybe a missing value is highly correlated with another column your dataset. If "missingness" is something you want to account for, you should model missing values.

How does this transformer handle timezones?

This transformer is capable of understanding timezone offsets. These should be represented as the difference (hour and minutes) from Greenwich Mean Time, which is considered +0000. For example New York (US East Coast) is 5 hours behind, so it's represented as -0500. Meanwhile, Delhi is 4 hours 30 minutes ahead, so it's represented as +0430. Be sure to represent the offset in your datetime_format string using the %z flag.

Instead of offsets, you may have timezone names such as EST or UTC. Timezones names are not globally standardized, so this transformer offers limited support for only the most common ones — UTC and GMT. Be sure to represent these in your datetime_format string using the %Z flag.

When transforming your data, the transformer will consider the timezone. For example 3pm in New York is the same as 8pm in London. Both of these datetime values will be converted to the same Unix time. When reverse transforming back to the original data, you'll see a single consistent timezone (your original one or +0000).

SDV Enterprise users have additional features. As an SDV Enterprise user, you'll be able to use the extract_timezone parameter, which adds an extra categorical column with the original timezone information. This is particularly useful if your data contains multiple timezones. When reverse transforming back to the original data, you'll see the same mix of timezones as the original data.

Last updated