* DomainBasedAnonymizer

*SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. To learn more about the SDV Enterprise and its extra features, visit our website.

The DomainBasedAnonymizer performs Contextual Anonymization on email data. It transforms emails by extracting their domains. When reversing the transform, it generates new, fake emails with the correct domains.

from rdt.transformers.email import DomainBasedAnonymizer

transformer = DomainBasedAnonymizer(obfuscate_emails=True)

Parameters

extracted_domain: Which parts of the overall email domain to extract during the transformation phase

(default) 'full'

Extract the full domain, which is everything after the @ sign. For example if the email is 'info@datacebo.com', the full domain is 'datacebo.com'.

'top'

Extract only the top domain, which is everything after the . character. For example if the email is 'info@datacebo.com', the top domain is 'com'.

enforce_unique_count: Limit the number of new emails created to the number originally found in the dataset.

(default) False

Create a variety of new emails based on the domain

True

Put a limit on the amount of new emails created. Emails will be recycled after the limit is reached.

Setting this to True will leak information about the number of unique emails within each domain. However, these emails will be newly createdones that may not appear in the original data. Always evaluate the risk of a data leak before sharing your transformed data.

obfuscate_emails: Control whether the overall email looks realistic or follows random patterns.

(default) False

Create realistic-looking usernames and emails such as 'johndoe@gmail.com'.

True

Obfuscate the usernames and emails to create random values such as 'dkep22ocp2@sdv-example.com'.

Setting this to False may result in emails that correspond to real user emails by complete coincidence. If you are worried about creating emails that accidentally correspond to real users, please set this to True.

Examples

from rdt.transformers.email import DomainBasedAnonymizer

transformer = DomainBasedAnonymizer(
    extracted_domain='top',
    enforce_unique_count=False,
    obfuscate_emails=True
)

Attributes

After fitting the transformer, you can access the learned values through the attributes.

domain_to_unique_count: The number of unique email addresses that belong to every domain of the original data.

>>> transformer.domain_to_unique_count
{
    'datacebo.com': 15,
    'gmail.com': 103,
    'yahoo.com': 10,
    'sdv.dev': 14
}

Note: If you have not selected to enforce unique emails per domain, then the transformer will not compute these values. If you have, then you'll see the count per domain, top or full domain as you specified.

Last updated