Conditional Sampling

SDV Enterprise Bundle. This feature is available as part of the Targeted Sampling, an optional add-on to SDV Enterprise. For more information, please visit the Targeted Sampling page.

Do you have exact values that you'd like to include in the synthetic data? Using conditional sampling to provide this information. Conditional sampling allows you to target the exact data you need, while still preserving correlations with other, related variables.

  • Generate hypothetical scenarios, by fixing the values to correspond to extreme cases

  • De-bias your data, by requesting an equal balance of labels

  • Impute unknown data, by requesting the data that you already know

In a multi-table setting, you can fix values in any number of different tables of your dataset. This feature is currently available for HSA and Independent Synthesizers.

from sdv.sampling import Condition, MultiTableCondition

# Step 1: Create Single-Table Conditions
resort_hotels = Condition(
    num_rows=10,
    table_name='hotels',
    column_values={'classification': 'RESORT'})

suite_guests_with_rewards = Condition(
    table_name='guests'
    column_values={'room_type': 'SUITE', 'has_rewards': True})

# Step 2: Compose Multi-Table Conditions    
suites_in_resorts = MultiTableCondition(
    conditions=[resort_hotels, suite_guests_with_rewards])

# Step 3: Sample Synthetic Data
synthetic_data = synthesizer.sample_from_conditions([suite_guests_with_rewards])

Follow the 3-step workflow below to create your conditions and then ask SDV to sample with them.

Step 1: Create Single-Table Conditions

Use conditions to specify which exact, fixed values you'd like to appear in each individual table of your synthetic data. SDV creates synthetic tables with the fixed values and creates other variables (and tables) based on it. The other values will contain the same patterns and correlations with respect to the fixed value.

You can create as many conditions as you need to fully describe the synthetic data that you'd like to create. Create them for 1 more more different tables in your dataset.

Condition

Create a Condition object to specify exact values you want to include in the synthetic data for a specific table.

In this example, we want to create 10 resort hotels, and some guests that are staying in suites.

from sdv.sampling import Condition

resort_hotels = Condition(
    num_rows=10,
    table_name='hotels',
    column_values={'classification': 'RESORT'})

suite_guests_with_rewards = Condition(
    table_name='guests'
    column_values={'room_type': 'SUITE', 'has_rewards': True})

Parameters

  • (required) table_name: A string with the name of the table for which you'd like to fix the values

  • column_values: A dictionary with the values to fix. The keys should be a column names and the values should be the exact data that the column should have. You can fix any columns, as long as they are not primary or foreign keys. In a multi-table setting, column values may not be needed. It may the be case that you want to only specify a number of rows (num_rows) rather than the column values.

  • num_rows: The number of rows you would like to create for the table. In a multi-table setting, this may not be needed. SDV is able to determine the number of rows for a table based on other tables that is is connected to. For example, if your original data had roughly 20 guests for every hotel, then SDV can automatically determine there should be 200 guests for 10 hotels.

You may require multiple conditions. Define as many Condition objects as you need to construct your hypothetical scenario. For example, you may want to create a 50/50 mix of active an inactive users of various tiers.

DataFrameCondition

Create a DataFrameCondition if you already have a DataFrame of fixed-value columns that you'd like to include in the synthetic data. This is a shortcut you can use instead of creating multiple Condition objects.

import pandas as pd
from sdv.sampling import DataFrameCondition

my_fixed_datapoints = pd.DataFrame(
    table_name='guests',
    data={
    'room_type': ['SUITE', 'SUITE', 'SUITE', 'SUITE', 'SUITE'],
    'has_rewards': [True, True, True, False, False]
    })

type_rewards_condition = DataFrameCondition(dataframe=my_fixed_datapoints)

Parameters:

  • (required) table_name: A string with the name of the table for which you'd like to fix values

  • (required) dataframe: A pandas.DataFrame object containing the columns whose values you want to fix

Step 2: Compose Multi-Table Conditions

Creating multi-table conditions tells SDV to ensure that the individual tables are connected to each other. Without this step, SDV can still create individual tables with the conditions, but they may not be connected to each other.

For example in our case, we may want to ensure that the suite rewards guests are all staying in resort hotels. We can use the multi-table condition to ensure this connection will exist.

from sdv.sampling import MultiTableCondition
 
suites_in_resorts = MultiTableCondition(
    conditions=[resort_hotels, suite_guests_with_rewards])

Parameters:

  • (required) conditions: A list of the single-table Condition or DataFrameCondition objects. SDV ensures that these tables are connected to each other.

What are the sizes of the tables? For at least one of the tables, the conditions must define the number of rows. SDV can then algorithmically determine the number of rows to generate for each of the other tables.

Step 3: Sample Synthetic Data

Once you have the synthetic data, you can use any single-table SDV synthesizer to sample conditions. Your synthesizer must have already been trained (fit) on the data.

sample_from_conditions

Use this function to create synthetic data based on the conditions.

synthetic_data = synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards, type_rewards_condition],
    output_file_path='synthetic_simulated_scenario.csv'
)

Parameters

  • (required) conditions: A list of Condition or DataFrameCondition objects that specify the exact values that you want to fix.

  • batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.

  • max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.

  • output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

Returns A pandas DataFrame object with synthetic data. The synthetic data is simulated based on the conditions.

Troubleshooting

Which synthesizer are you using?

The SDV synthesizers have different conditional sampling capabilities. Under-the-hood, HSA and Independent Synthesizers are set to use a single-table synthesizer for each individual table. If you are using the CTGAN, TVAE or CopulaGAN synthesizers, the SDV may unable to complete your conditional sampling request in some instances.

Neural network-based synthesizers use a reject sampling approach: They sample synthetic data freely, keep the rows that match your conditions and repeat the process as needed. This may not be efficient if the conditional values are extremely rare.

Some suggestions:

  • Use a larger batch_size or max_tries_per_batch. The more rare your conditions, the more attempts the SDV will have to make.

  • Try using the GaussianCopulaSynthesizer instead. This synthesizer can sample conditions mathematically instead of reject sampling, which is more efficient.

Are you including constraints?

Any synthesizer that has constraints may have to use reject sampling to ensure the rows are valid. This can slow down the process.

Suggestions:

  • Use a larger batch_size or max_tries_per_batch. The more constraints you have, the more attempts the SDV will have to make.

  • Consider removing constraints and refitting your synthesizer. This will help if you are conditioning on desired columns that were involved in a constraint.

Are you requesting data that is out of range?

The SDV synthesizers are designed to learn patterns from the input data that you've added during fit, including the the min and max ranges of each column. If you are requesting conditional values that are outside of the ranges, the synthesizer may not be able to accommodate your request.

Check to see whether you values at out of range. If so, you may have more success if you fit your synthesizer without enforcing min max values. For more details, see the Modeling API.

synthesizer = HSASynthesizer(metadata)

synthesizer.set_table_parameters(
    table_name='guests',
    table_synthesizer='GaussianCopulaSynthesizer',
    table_parameters={
        'enforce_min_max_values': False
    }
)

Need more help?

Raise an issue on GitHub with more details about your usage. To help us replicate your issue, please provide us with as much detail as possible about your data, the synthesizer you're using and any parameters or features you're using with it.

Last updated