OneHotEncoding

The OneHotEncoding constraint enforces that a set of columns follow a one hot encoding scheme↗. That is, exactly one of the columns must contain a value of 1 while all the others must be 0.

Constraint API

Create a OneHotEncoding constraint.

Parameters:

  • (required) column_names: A list of column names that, together, form the one hot encoding scheme. The columns must be listed as numerical in your metadata.

  • table_name: A string with the name of the table to apply this to. Required if you have a multi-table dataset.

  • learning_strategy: A string that controls how SDV should ultimately enforce the constraint internally. No matter which strategy you chose, you are guaranteed to have synthetic, one-hot encoded columns. Options:

    • (default) 'one_hot': SDV will keep the one hot encoded columns for the underlying ML model to learn. When creating synthetic data, SDV will enforce that only 1 of the columns will be picked.

    • 'categorical': SDV will collapse the one hot encoded columns into a single, categorical column for the underlying ML model to learn. When creating synthetic data, SDV will expand the category value back into the original set of one hot encoded columns.

from sdv.cag import OneHotEncoding

my_constraint = OneHotEncoding(
    column_names=['status_active', 'status_inactive', 'status_on_hold'],
    learning_strategy='one_hot'
)

Usage

Apply the constraint to any SDV synthesizer. Then fit and sample as usual.

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([my_constraint])

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

For more information about using predefined constraints, please see the Constraint-Augmented Generation tutorial.

FAQs

What is the difference between the learning strategies? Which one do I pick?

The learning strategy parameter influences how SDV will internally enforce the constraints.

  • 'one_hot': This strategy will keep all the columns in place for the underlying ML model to learn from. Keeping all the columns may impact performance, making the fitting and the sampling slower than before. However for certain synthesizers like TVAE, this will produce higher quality data — meaning that the proportion of categories will more closely match the original.

  • 'categorical': This strategy will collapse the one hot encoded columns into a single, categorical column for the underlying ML model to learn from. This can improve performance, as there is only 1 column to learn instead of many. For certain synthesizers like GaussianCopula, this may also produce higher quality data — meaning that the proportion of categories will more closely match the original. When creating synthetic data, SDV will expand the value back into the original set of one hot encoded column names.

It's important to note that both learning strategies will ultimately create 100% valid synthetic data with a set of one hot encoded columns.

Last updated