Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Viewing the Transformations
  • auto_assign_transformers
  • get_transformers
  • Modifying the Transformers
  • update_transformers
  • * update_transformers_by_sdtype
  • Anonymizing Sensitive Data
  • Applying the Transformations
  • preprocess
  • fit_processed_data
  • fit
  • FAQ
  1. Multi Table Data
  2. Modeling
  3. Customizations

Preprocessing

PreviousConstraintsNext* Performance Estimates

Last updated 2 months ago

All synthesizers pre-process your data to prepare it for machine learning and then post-process synthetic data to convert it into the original format. These advanced feature allow you to control the transformations that are applied to each column.

Do you have any sensitive data? The transformations are also used for anonymizing or pseudo-anonymizing sensitive data.

Viewing the Transformations

The SDV will pre and post-process your data by using reversible data transformations for each column. The transformers are available in our .

The methods below allow you to see which transformations will be applied to each column based on the synthesizer and data you plan to use.

auto_assign_transformers

Let the synthesizer auto assign the transformations based on the data you'd like to use for modeling.

Parameters

  • (required) data: A dictionary that maps the name of each table to a object containing the real data that the machine learning model will learn from

Output (None)

synthesizer.auto_assign_transformers(data)

get_transformers

After assigning the transformers, you can get more details about the transformers that will be used on each column.

Parameters

  • (required) table_name: A string with the table name you'd like to see the transformer for

synthesizer.get_transformers(
    table_name='guests'
)
{
 'guest_email': AnonymizedFaker(provider_name='internet', function_name='email', enforce_uniqueness=True),
 'hotel_id': None,
 'has_rewards': LabelEncoder(add_noise=True),
 'room_type': None,
 'amenities_fee': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'checkin_date': UnixTimestampEncoder(datetime_format='%d %b %Y'),
 'checkout_date': UnixTimestampEncoder(datetime_format='%d %b %Y'),
 'room_rate': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'billing_address': AnonymizedFaker(provider_name='address', function_name='address'),
 'credit_card_number': AnonymizedFaker(provider_name='credit_card', function_name='credit_card_number', enforce_uniqueness=True)
}

Modifying the Transformers

After assigning transformers, you can also modify them to customize the pre- and post-processing.

update_transformers

Parameters

  • (required) table_name: A string representing the table name yo'd like to update the transformer for

  • (required) column_name_to_transformer: A dictionary mapping the column name to the new RDT transformer object that you'd like to use for pre and post processing.

Output (None)

from rdt.transformers import FloatFormatter

synthesizer.update_transformers(
    table_name='guests',
    column_name_to_transformer={
        'amenities_fee': FloatFormatter(missing_value_replacement=0.0)
    }
)

Be careful with this step! Make sure you are updating transformers that are compatible with each column's sdtype and make sense for the underlying machine learning model.

* update_transformers_by_sdtype

Parameters

  • (required) sdtype: A string with the name of the sdtype that you want to change the transformers for

  • (required) transformer_name: A string with the name of the transformer to use.

  • transformer_parameters: A dictionary that maps the name of the transformer parameter (string) to the parameter value. Use this if you want to override the default settings.

    • (default) None: Do not override the default settings

  • table_names: A list of strings describing the specific table names to apply the transformers to

    • (default) None: Update the transformers for all tables

    • <list>: Only update the transformers for the given table names

synthesizer.update_transformers_by_sdtype(
    sdtype='numerical',
    transformer_name='FloatFormatter',
    transformer_parameters={
        'missing_value_replacement': 'random',
        'missing_value_generation': 'from_column',
    },
    table_names=['guests'])

Anonymizing Sensitive Data

from rdt.transformers import PseudoAnonymizedFaker 

synthesizer.update_transformers(column_name_to_transformer={
    'credit_card_number': PsuedoAnonymizedFaker(provider_name='credit_card', function_name='credit_card_number'),
})

Anonymization vs. Pseudo-anonymization

Pseudo-anonymization preserves a mapping between the real, sensitive values and the fake, synthetic data. Use this if you'd like to trace back the synthetic data to real values.

Anonymization is not reversible. Anyone with access to the fake, synthetic data will not be able to it back to any value of real data.

Applying the Transformations

After modifying the transformations, you can apply them to the real data in a step-wise fashion or all at once.

preprocess

Use this function to preprocess the data according to the transformations. After preprocessing, you should have numerical data that is ready for modeling.

processed_data = synthesizer.preprocess(real_data)

Parameters

fit_processed_data

Use this function to perform model training on preprocessed data. Note that this step may take a while, as it requires machine learning to learn trends from your the dataset.

synthesizer.fit_processed_data(processed_data)

Parameters

fit

Use this function to perform the preprocessing and modeling training all in one step. Internally, this uses both the preprocess and fit_processed_data functions in succession.

synthesizer.fit(real_data)

Parameters

FAQ

Why am I seeing column names that are different from my original data?

If you have added constraints, then the SDV may add or delete columns to accommodate the business logic.

Can I remove a transformer if I don't want any processing?

In some cases, your data may already be processed such that the synthesizer does not need to perform additional transformations. You can remove transformers by assigning them to None.

synthesizer.update_transformers(
  table_name='hotels',
  column_name_to_transformer={
    'rating': None
  })
Why are some of the transformers None?

Based on the data you're using, the synthesizer may determine that no transformations are necessary for machine learning. In this case, you will see that the column is not assigned to None, meaning that the data will not undergo any transformation for the machine learning model.

Output A dictionary mapping each column name to an that will pre and post-process your data.

Use this method to reassign new transformers that you'd like to use for specific columns. You can apply any transformer in the to your data.

Use this method to reassign new transformers for all columns of a specific sdtype. For example, reassign the processing for all numerical or categorical columns. You can apply any transformer in the to your data.

Use the or for sensitive data.

(required) data: A dictionary mapping the table name (string) to a object with the real data for that table

Output A object with the preprocessed version of the real data. Based on the transformations, some columns may be converted, added or removed.

(required) processed_data: A object with data that is ready for machine learning. (This is the output of the preprocess method.)

Output (None). This steps trains the model using machine learning. After this step, you are ready to create synthetic data. For more details, see .

(required) data: A object with the real data that you want the synthesizer to learn from

Output (None) After this step, you are ready to create synthetic data. For more details, see .

RDT library
pandas DataFrame
AnonymizedFaker
PsuedoAnonymizedFaker
pandas DataFrame
pandas DataFrame
pandas DataFrame
Sampling
pandas DataFrame
Sampling

*SDV Enterprise Feature. This feature is only available for licensed, enterprise users. For more information, visit our page to Explore SDV.

RDT transformer
RDT Library
RDT Library