RegexGenerator

Compatibility: text data

The RegexGenerator is used to create structured text. When transforming the data, it simply removes the column. When reversing the transform, it recreates the structured text in the column through a regex string.

from rdt.transformers.text import RegexGenerator

transformer = RegexGenerator()

You can specify the exact regex string to use for more realistic data.

Parameters

regex_format: A string that represents a Regular Expression↗. This expression will be used to generate new data.

(default) '[A-Za-z]{5}'

Generate 5-character strings such as 'ABCDE'.

<string>

Use the specified regex string to generate new values.

enforce_uniqueness: Whether to guarantee that the created fake data will be unique

(default) False

The generated data may contain repeating values

True

The generated data will not contain any repeating values

generation_order: Which order to use when generating the regexes (during the reverse transform)

(default) 'alphanumeric'

Generate the data sequentially, or in alphanumeric order. For eg. 'aaa', 'aab', 'aac', etc.

scrambled

Generate the data sequentially but then scramble it before returning the results. For large batches of data, this is an effective way to achieve the notion of randomness.

Examples

from transformers.text import RegexGenerator

# generate values that follow the format 'ID_' followed by a 3-digit number
rg = RegexGenerator(
    regex_format='ID_\d{3}',
    enforce_uniqueness=True
)

FAQs

Are all regexes supported?

The RegexGenerator does not currently support regexes with sub-patterns, which are frequently used to indicate an "or" logic. For example, the regex string '([A-Z]{2}|\d{4})' is intended to match a 2-character string such as 'DB' or a 4-digit string such as '0391'. This regex is not suitable for the RegexGenerator.

Tip: If you are trying to express a basic index column with countable integers (0, 1, 2, ...), we recommend using the IDGenerator instead of this transformer. The IDGenerator also allows you to input a prefix and suffix to the index.

When should I use this transformer?

The RegexGenerator is useful for text columns that do not have any mathematical meaning. This transformers follows the regex format to generate values, which may be exactly the same as the real data depending on the exact format string.

This transformer is useful for columns that represent structured IDs, such as a primary key column.

Last updated