RegexGenerator
Compatibility: id
data
The RegexGenerator
is used to create structured text. When transforming the data, it simply removes the column. When reversing the transform, it recreates the structured text in the column through a regex string.

from rdt.transformers.text import RegexGenerator
transformer = RegexGenerator()
You can specify the exact regex string to use for more realistic data.
Parameters
regex_format
: A string that represents a Regular Expression↗. This expression will be used to generate new data.
(default) '[A-Za-z]{5}'
Generate 5-character strings such as 'ABCDE'
.
<string>
Use the specified regex string to generate new values.
cardinality_rule
: How many unique values to create in the fake data
(default) None
Do not impose any rules. Any number of Regex values can be generated.
'unique'
The generated data should not contain any repeating values. Note: This option may limit the amount of data that you can create using the Regex
'match'
Learn the number of unique values from the fit data and ensure that the generated data contains the same number. These may be repeated.
'scale'
Learn the number of unique values from the fit data and scale it proportionally when generating data. For example, if there are 25 unique values for every 100 rows of data, the transformer will create 50 unique values when generating 200 rows.
(deprecated) enforce_uniqueness
: Use the cardinality_rule
parameter instead.
generation_order
: Which order to use when generating the regexes (during the reverse transform)
(default) 'alphanumeric'
Generate the data sequentially, or in alphanumeric order. For eg. 'aaa'
, 'aab'
, 'aac'
, etc.
'scrambled'
Generate the data sequentially but then scramble it before returning the results. For large batches of data, this is an effective way to achieve the notion of randomness.
* 'random'
Generate data completely randomly. This method works even for small batches of data.
Examples
from transformers.text import RegexGenerator
# generate values that follow the format 'ID_' followed by a 3-digit number
rg = RegexGenerator(
regex_format='ID_\d{3}',
enforce_uniqueness=True
)

FAQs
Last updated