Configuration

In order for your HyperTransformer to work, you'll need to provide it a configuration that describes:

the columns in your dataset and
the transformers that should be applied to turn them into numerical data.

Creating the config

To create the config you can either allow the HyperTransformer to automatically detect it from your data or you can write it by hand.

detect_initial_config()

This method automatically detects the config from your data and sets it. It overrides any existing config you may have previously set or detected.

Parameters

(required) data: a pandas DataFrame containing your data.

Output (None) This function prints out the status and detected config. The config describes the sdtypes of each column and the transformer objects that will be used for each. For more details, see the Basic Concepts guide.

Examples

ht.detect_initial_config(data=customers)

Detecting a new config from the data ... SUCCESS
Setting the new config ... SUCCESS

Config:
{
  'sdtypes': {
    'last_login': 'datetime',
    'email_optin': 'boolean',
    'credit_card': 'categorical',
    'age': 'numerical',
    'dollars_spent': 'numerical'
  },
  'transformers': {
    'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
    'email_optin': UniformEncoder(),
    'credit_card': UniformEncoder(),
    'age': FloatFormatter(),
    'dollars_spent': FloatFormatter(missing_value_replacement="mean")
  }
}

set_config()

This method sets the config. Use this as an alternative to detect_initial_config if you want to write and set the config manually.

Parameters

(required) config: A nested dictionary that describes the config. It must follow the format shown below.

{
  'sdtypes': {
    'column_name': <sdtype>,
    'column_name': <sdtype>,
    ...
  },
  'transformers': {
    'column_name': <transformer object>,
    'column_name': <transformer object>,
    ...
  } 
}

The public RDT supports the following sdtypes:'categorical', 'datetime', 'numerical', 'pii' and 'id'

You can use any transformer object from the RDT (or specify None if you do not want to transform the column). Visit the Transformers Glossary to browse through the available transformers and their settings.

See the Config guide for more details.

Output (None)

Examples

You must provide the full config that describes all the columns in your dataset.

from rdt.transformers.datetime import UnixTimestampEncoder
from rdt.transformers.categorical import LabelEncoder
from rdt.transformers.numerical import FloatFormatter

ht.set_config(config={
  'sdtypes': {
    'last_login': 'datetime',
    'email_optin': 'boolean',
    'credit_card': 'categorical',
    'age': 'numerical',
    'dollars_spent': 'numerical'
  },
  'transformers': {
    'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
    'email_optin': UniformEncoder(),
    'credit_card': UniformEncoder(),
    'age': None,
    'dollars_spent': FloatFormatter(missing_value_replacement="mean")
  }
})

Viewing the config

get_config()

At any point, you can use this method to retrieve the current config.

Parameters (None)

Output A nested dictionary that describes the config. It follows the format shown below.

{
  'sdtypes': {
    'column_name': <sdtype>,
    'column_name': <sdtype>,
    ...
  },
  'transformers': {
    'column_name': <transformer object>,
    'column_name': <transformer object>,
    ...
  }

See the Config guide for more details.

Examples

config = ht.get_config()

{
  'sdtypes': {
    'last_login': 'datetime',
    'email_optin': 'boolean',
    'credit_card': 'categorical',
    'age': 'numerical',
    'dollars_spent': 'numerical'
  },
  'transformers': {
    'last_login': UnixTimestampEncoder(missing_value_replacement="mean"),
    'email_optin': UniformEncoder(),
    'credit_card': UniformEncoder(),
    'age': None,
    'dollars_spent': FloatFormatter(missing_value_replacement="mean")
  }
}

Modifying the config

Customize your HyperTransformer by modifying the config.

update_sdtypes()

This method modifies the sdtypes. It also automatically assigns a new transformer that's compatible with the new sdtype.

Parameters

(required) column_name_to_sdtype: A dictionary that maps a column name to its new sdtype. The public RDT supports 'boolean', 'categorical', 'datetime', 'numerical', 'pii' and 'id' sdtypes. More are available for licensed users.

Output (None) After using this method, you can use get_config() to verify the changes.

Examples

ht.update_sdtypes(column_name_to_sdtype={
  'last_login': 'datetime',
  'email_optin': 'categorical'
})

update_transformers()

This method updates the transformers that will be used on specific columns. Use it to customize your HyperTransformer, for example by changing a transformer setting or swapping out one transformer for another.

Parameters

(required) column_name_to_transformer: A dictionary that maps a column name to the new transformer that will be used on it.

You can use any transformer object from the RDT. Visit the Transformers Glossary to browse through the available transformers and their settings.

Output (None) After using this method, you can use get_config() to verify the changes.

Examples

To update transformers, you must first create the transformers you want to use and then apply the method.

from rdt.transformers.datetime import OptimizedTimestampEncoder
from rdt.transformers.categorical import LabelEncoder

# create new transformer objects
login_transformer = OptimizedTimestampEncoder(missing_value_replacement='random')
credit_transformer = LabelEncoder(add_noise=True)

# update the columns to use our the new transformers
ht.update_transformers(column_name_to_transformer={
  'last_login': login_transformer,
  'credit_card': credit_transformer
})

remove_transformers()

This method removes transformers for specific columns. Use this is if you do not want the HyperTransformer to modify certain columns at all. It will skip over the column names and modify the remaining columns that do have transformers.

Parameters

(required) column_names: A list of column names. The transformers for these column names are removed.

Output (None) After using this method, you can use get_config() to verify the changes.

Examples

# do not transform the credit_card or age columns
ht.remove_transformers(column_names=['credit_card', 'age'])

update_transformers_by_sdtype()

This method updates all columns of a given sdtype to using a specific transformer.

Parameters

(required) sdtype: An sdtype. This method will select all columns that match the sdtype.
(required) transformer_name: A string with the name of the transformer to use.
transformer_parameters: A dictionary that maps the name of the transformer parameter (string) to the parameter value. Use this if you want to override the default settings.

Visit the Transformers Glossary to browse through the available transformers and their settings.

Output (None) After using this method, you can use get_config() to verify the changes.

Examples

# update all numerical columns to use a specific transforemr
ht.update_transformers_by_sdtype(
  sdtype='numerical',
  transformer_name='FloatFormatter',
  transformer_parameters={'missing_value_generation': 'from_column',
                          'enforce_min_max_values': True}
)

remove_transformers_by_sdtype()

This method removes transformers for all columns of a given sdtype. Use this method if you do not want to transform any columns of a particular sdtype.

Parameters

(required) sdtype: An sdtype. This method will remove the transformer for all columns that match the given sdtype.

Output (None) After using this method, you can use get_config() to verify the changes.

Examples

# do not transform any categorical columns in the dataset
ht.remove_transformers_by_sdtype(sdtype='categorical')

PreviousPreparation NextTransformation

Last updated 8 months ago