* DomainBasedAnonymizer
*SDV Enterprise Feature. This feature is available to our licensed users and is not currently in our public library. To learn more about the SDV Enterprise and its extra features, get in touch with us.
The
DomainBasedAnonymizer
performs Contextual Anonymization on email data. It transforms emails by extracting their domains. When reversing the transform, it generates new, fake emails with the correct domains.
from rdt.transformers.email import DomainBasedAnonymizer
transformer = DomainBasedAnonymizer(obfuscate_emails=True)
extracted_domain
: Which parts of the overall email domain to extract during the transformation phase(default) 'full' | Extract the full domain, which is everything after the @ sign. For example if the email is '[email protected]' , the full domain is 'datacebo.com' . |
'top' | Extract only the top domain, which is everything after the . character. For example if the email is '[email protected]' , the top domain is 'com' . |
enforce_unique_count
: Limit the number of new emails created to the number originally found in the dataset.(default) False | Create a variety of new emails based on the domain |
True | Put a limit on the amount of new emails created. Emails will be recycled after the limit is reached. |
Setting this to
True
will leak information about the number of unique emails within each domain. However, these emails will be newly createdones that may not appear in the original data. Always evaluate the risk of a data leak before sharing your transformed data.obfuscate_emails
: Control whether the overall email looks realistic or follows random patterns.(default) False | Create realistic-looking usernames and emails such as '[email protected]' . |
True | Obfuscate the usernames and emails to create random values such as '[email protected]' . |
Setting this to
False
may result in emails that correspond to real user emails by complete coincidence. If you are worried about creating emails that accidentally correspond to real users, please set this to True
.from rdt.transformers.email import DomainBasedAnonymizer
transformer = DomainBasedAnonymizer(
extracted_domain='top',
enforce_unique_count=False,
obfuscate_emails=True
)
After fitting the transformer, you can access the learned values through the attributes.
domain_to_unique_count
: The number of unique email addresses that belong to every domain of the original data.>>> transformer.domain_to_unique_count
{
'datacebo.com': 15,
'gmail.com': 103,
'yahoo.com': 10,
'sdv.dev': 14
}
Note: If you have not selected to enforce unique emails per domain, then the transformer will not compute these values. If you have, then you'll see the count per domain, top or full domain as you specified.
Last modified 3mo ago