Conditional Sampling

Do you have exact values that you'd like to include in the synthetic data? Using conditional sampling to provide this information. Conditional sampling can allow you to:

Generate hypothetical scenarios, by fixing the values to correspond to extreme cases
De-bias your data, by requesting an equal balance of labels
Impute unknown data, by requesting the data that you already know

Providing Fixed Conditions

Use conditions to provide exact, fixed values that you'd like. The SDV factors in your conditions and updates the rest of the data based on them.

Define Your Conditions

Create a Condition object to specify an exact values that you want to include in the synthetic data, one or more columns.

from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)

Parameters

(required) num_rows: The number of rows that need to be included in the scenario
(required) column_values: A dictionary with the scenario. The keys should be a column names and the values should be the exact data that the column should have. You can fix any columns, as long as they are not primary or foreign keys.

You may require multiple conditions. Define as many Condition objects as you need to construct your hypothetical scenario. For example, you may want to create a 50/50 mix of active an inactive users of various tiers.

sample_from_conditions

Use this function to create synthetic data based on the conditions.

synthetic_data = custom_synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
    output_file_path='synthetic_simulated_scenario.csv'
)

Parameters

(required) conditions: A list of Condition objects that specify the exact values that you want to fix and the number of rows to synthesize.
batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

Returns A pandas DataFrame object with synthetic data. The synthetic data is simulated based on the conditions.

The quality of your synthetic data is preserved. When you provide conditions, the SDV synthesizers will match your conditions and create the remaining data based on the patterns that it learned. Your synthetic data will continue to have the same statistical patterns between columns.

Condition on Known Columns

Do you already know all the information in particular columns? The SDV can factor in these columns and generate the remaining columns based on them.

sample_remaining_columns

Use this function to sample remaining columns based on known, reference columns.

import pandas as pd

reference_data = pd.DataFrame(data={
    'room_type': ['SUITE', 'SUITE', 'DELUXE', 'BASIC', 'BASIC'],
    'has_rewards': [True, True, True, False, False]
})

synthetic_data = synthesizer.sample_remaining_columns(
    known_columns=reference_data,
    max_tries_per_batch=500
)

Parameters

(required) known_columns: A pandas DataFrame object with columns that you already know the values for.
batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

Returns A pandas DataFrame object with synthetic data. The synthetic data is based on the known, reference columns.

The quality of your synthetic data is preserved. When you provide known columns, the SDV synthesizers will match the columns and create the remaining columns based on the patterns that it learned. Your synthetic data will continue to have the same statistical patterns between columns.

Troubleshooting

Conditional sampling is a complex feature. In some cases, your synthesizer may not be able to create all rows of synthetic data that you request. Let's walk through some areas that you can investigate.

Which synthesizer are you using?

The SDV synthesizers have different conditional sampling capabilities. If you are using the CTGAN, TVAE or CopulaGAN synthesizers, the SDV may unable to complete your conditional sampling request in some instances.

Neural network-based synthesizers use a reject sampling approach: They sample synthetic data freely, keep the rows that match your conditions and repeat the process as needed. This may not be efficient if the conditional values are extremely rare.

Some suggestions:

Use a larger batch_size or max_tries_per_batch. The more rare your conditions, the more attempts the SDV will have to make.
Try using the GaussianCopulaSynthesizer instead. This synthesizer can sample conditions mathematically instead of reject sampling, which is more efficient.

Are you including constraints?

Any synthesizer that has constraints may have to use reject sampling to ensure the rows are valid. This can slow down the process.

Suggestions:

Use a larger batch_size or max_tries_per_batch. The more constraints you have, the more attempts the SDV will have to make.
Consider removing constraints and refitting your synthesizer. This will help if you are conditioning on desired columns that were involved in a constraint.

Are you requesting data that is out of range?

The SDV synthesizers are designed to learn patterns from the input data that you've added during fit, including the the min and max ranges of each column. If you are requesting conditional values that are outside of the ranges, the synthesizer may not be able to accommodate your request.

Check to see whether you values at out of range. If so, you may have more success if you fit your synthesizer without enforcing min max values. For more details, see the Modeling API.

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    enforce_min_max_values=False
)

Need more help?

Raise an issue on GitHub with more details about your usage. To help us replicate your issue, please provide us with as much detail as possible about your data, the synthesizer you're using and any parameters or features you're using with it.

PreviousSample Realistic Data NextEvaluation

Last updated 1 month ago