LogoLogo
GitHubSlackDataCebo
  • RDT: Reversible Data Transforms
  • Getting Started
    • Installation
    • Quickstart
  • Usage
    • Basic Concepts
    • HyperTransformer
      • Preparation
      • Configuration
      • Transformation
  • Transformers Glossary
    • Numerical
      • ClusterBasedNormalizer
      • FloatFormatter
      • GaussianNormalizer
      • LogScaler
      • LogitScaler
      • * OutlierEncoder
      • ❖ DPECDFNormalizer
      • ❖ DPLaplaceNoiser
      • ❖ ECDFNormalizer
      • ❖ XGaussianNormalizer
    • Categorical
      • LabelEncoder
      • OrderedLabelEncoder
      • FrequencyEncoder
      • OneHotEncoder
      • OrderedUniformEncoder
      • UniformEncoder
      • BinaryEncoder
      • ❖ DPDiscreteECDFNormalizer
      • ❖ DPResponseRandomizer
      • ❖ DPWeightedResponseRandomizer
    • Datetime
      • OptimizedTimestampEncoder
      • UnixTimestampEncoder
      • ❖ DPTimestampLaplaceNoiser
    • ID
      • AnonymizedFaker
      • IndexGenerator
      • RegexGenerator
      • Treat IDs as categorical labels
    • Generic PII Anonymization
      • AnonymizedFaker
      • PseudoAnonymizedFaker
    • * Deep Data Understanding
      • * Address
        • * RandomLocationGenerator
        • * RegionalAnonymizer
      • * Email
        • * DomainBasedAnonymizer
        • * DomainBasedMapper
        • * DomainExtractor
      • * GPS Coordinates
        • * RandomLocationGenerator
        • * GPSNoiser
        • * MetroAreaAnonymizer
      • * Phone Number
        • * AnonymizedGeoExtractor
        • * NewNumberMapper
        • * GeoExtractor
  • Resources
    • Use Cases
      • Contextual Anonymization
      • Differential Privacy
      • Statistical Preprocessing
    • For Businesses
    • For Developers
Powered by GitBook
On this page
  • Parameters
  • Examples
  • FAQs
  1. Transformers Glossary
  2. Generic PII Anonymization

PseudoAnonymizedFaker

PreviousAnonymizedFakerNext* Deep Data Understanding

Last updated 11 months ago

Compatibility: pii data

The PseudoAnonymizedFaker pseudo-anonymizes private or sensitive data. When transforming the column, it converts the original data to numerical values. When reversing the transform, it pseudo-anonymizes the column by mapping each value to a completely new, fake data using the . Note that the mapping is consistent so the real, sensitive values can be recovered.

from rdt.transformers.pii import PseudoAnonymizedFaker

transformer = PseudoAnonymizedFaker()

You can specify the exact faker method to use for more realistic data.

Parameters

provider_name: The name of the provider to use from the Faker library.

(default) None

<string>

function_name: The name of the function to use within the Faker provider.

(default) 'lexify'

<string>

Together, the provider_name and function_name parameters specify exactly how to create fake data. Some common values are:

function_kwargs: Optional parameters to pass into the function that you're specifying to create Fake data.

(default) None

Do not specify any additional parameters

<dictionary>

locales: An optional list of locales to use when generating the Fake data.

(default) None

Use the default locale, which is usually set to the country you are in.

<list>

Setting a locale might leak information about the original data. Anyone with access to the anonymized data will be able to tell which countries and locales are included in the original data .

Examples

from rdttransformers.pii import PseudoAnonymizedFaker

# create more realistic-looking data by specifying a provider and function
transformer = PseudoAnonymizedFaker(
    provider_name="person",
    function_name="name")

FAQs

When should I use this transformer?

Use the PseudoAnonymizedFaker whenever you have sensitive data that should not be part of your data science project. By default, the transformer reverses the transform into fake, 4-character strings such as "UaNJ" in place of the original, sensitive data.

Use this transformer as-is if the values in your sensitive data do not matter. Alternatively, supply a provider and function name to create fake data that looks more realistic.

Will any of the real values show up in the fake data?

The PseudoAnonymizedFaker generates data randomly without looking at the real values. So there is a small chance that a real value may show up in the real data by complete coincidence. For example, if your real data had a phone number "(617)123-4567", there's a small probability that the exact same phone number will be created by random chance. It may or may not map to a different number.

This behavior actually protects your sensitive data! Otherwise, anyone with access to the fake data would be able to deduce the real values by noting down what's missing.

What is the difference between the AnonymizedFaker and the PseudoAnonymizedFaker?

Pseudo-anonymization indicates that the scheme can be reversed while anonymization indicates that it's permanent.

This transformer pseudo-anonymizes data in a reversible way using consistent mapping between the original and fake data. This behavior allows you to add protection to your original data while also providing the option to recover the sensitive values. Note that anyone with access to the transformer will be able to lookup the mapping and uncover the real values.

Can I create and use my own custom Faker providers with PsuedoAnonymizedFaker?
Can I access the mapping that is used?

Yes. This transformer has a get_mapping() function that will return the mapping.

If you are using a HyperTransformer access the transformer through the config.

# assume ht is a HyperTransformer that is already fit
paf = ht.get_config()['transformers']['person']
paf.get_mapping()
{
    'Tim Berners-Lee': 'Amber Lee',
    'Ada Lovelace': 'Christopher Brown',
    'Grace Hopper': 'Amanda Jackson'
}

Use the from Faker, which capable of creating random text.

Use the provider for a specific context, for example or .

Use the to create random 4-character text.

Use the function from the specified provider to generate fake data. For example, from the address provider or from the bank provider.

A : provider_name="address", function_name="address"

A : provider_name="bank", function_name="bban"

A : provider_name="credit_card", function_name="credit_card_number"

: provider_name="geo", function_name="local_latlng"

A : provider_name="phone_number", function_name="phone_number"

To browse for more options, visit the .

Additional parameters to add. These are unique to the function name and should be represented as a dictionary. For example for the banking function, you can specify: {"length": 11, "primary": True}.

Create data from the list of locales. These are specified as strings representing the language and country from Faker. For example [, ].

If you want to prevent the possibility of recovering sensitive values, use the instead.

At this time, PsuedoAnonymizedFaker doesn't explicitly support custom Faker functions that you've created yourself. You can use any of the in Faker.

full address
basic bank account number
full credit card number
Latitude/longitude coordinates
phone number
Faker library's docs
AnonymizedFaker
standard providers
BaseProvider
"address"
"bank"
lexify method
"street_address"
"swift"
"swift"
"en_US"
"fr_CA"
Python Faker library