# Conditional Sampling

Do you have exact values that you'd like to include in the synthetic data? Using **conditional sampling** to provide this information. Conditional sampling allows you to target the exact data you need, while still preserving correlations with other, related variables.

* Generate hypothetical scenarios, by fixing the values to correspond to extreme cases
* De-bias your data, by requesting an equal balance of labels
* Impute unknown data, by requesting the data that you already know&#x20;

```python
from sdv.sampling import Condition

# Step 1: Create Your Conditions
suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

# Step 2: Sample Synthetic Data
synthetic_data = synthesizer.sample_from_conditions([suite_guests_with_rewards])
```

Follow the 2-step workflow below to create your conditions and then ask SDV to sample with them.

## Step 1: Create Your Conditions

Use conditions to specify which exact, fixed values you'd like to appear in your synthetic data. SDV creates synthetic data with the fixed value and creates other variables based on it. The other variables will contain the same patterns and correlations with respect to the fixed value.&#x20;

You can create as many conditions as you need to fully describe the synthetic data that you'd like to create.

### Condition

Create a `Condition` object to specify exact values you want to include in the synthetic data.

*In this example, we want to create 250 guests staying in suites with rewards, and an additional 250 guests staying in suites without rewards.*

```python
from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)
```

**Parameters**

* (required) `num_rows`: The number of rows that need to be included in the scenario
* (required) `column_values`: A dictionary with the scenario. The keys should be a column names and the values should be the exact data that the column should have. *You can fix any columns, as long as they are not primary or foreign keys.*

{% hint style="info" %}
**You may require multiple conditions.** Define as many Condition objects as you need to construct your hypothetical scenario. For example, you may want to create a 50/50 mix of active an inactive users of various tiers.
{% endhint %}

### DataFrameCondition

Create a `DataFrameCondition` if you already have a DataFrame of fixed-value columns that you'd like to include in the synthetic data. This is a shortcut you can use instead of creating multiple `Condition` objects.

```python
import pandas as pd
from sdv.sampling import DataFrameCondition

my_fixed_datapoints = pd.DataFrame(data={
    'room_type': ['SUITE', 'SUITE', 'SUITE', 'SUITE', 'SUITE'],
    'has_rewards': [True, True, True, False, False]
})

type_rewards_condition = DataFrameCondition(dataframe=my_fixed_datapoints)
```

**Parameters**:

* (required) `dataframe`: A pandas.DataFrame object containing the columns whose values you want to fix

## Step 2: Sample Synthetic Data

Once you have the synthetic data, you can use any single-table SDV synthesizer to sample conditions. Your synthesizer must have already been trained (fit) on the data.

### sample\_from\_conditions

Use this function to create synthetic data based on the conditions.

```python
synthetic_data = synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards, type_rewards_condition],
    output_file_path='synthetic_simulated_scenario.csv'
)
```

**Parameters**

* (required) `conditions`: A list of Condition or DataFrameCondition objects that specify the exact values that you want to fix.
* `batch_size`: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as `num_rows`.
* `max_tries_per_batch`: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to `100`.
* `output_file_path`: A string describing a CSV filepath for writing the synthetic data. Specify to `None` to skip writing to a file. Defaults to `None`.

**Returns** A [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object with synthetic data. The synthetic data is simulated based on the conditions.

{% hint style="success" %}
**The quality of your synthetic data is preserved.** When you provide conditions, the SDV synthesizers will match your conditions and create the remaining data based on the patterns that it learned. Your synthetic data will continue to have the same statistical patterns between columns.
{% endhint %}

## Troubleshooting

{% hint style="warning" %}
**Conditional sampling is a complex feature.** In some cases, your synthesizer may not be able to create all rows of synthetic data that you request. Let's walk through some areas that you can investigate.
{% endhint %}

#### Which synthesizer are you using?

The SDV synthesizers have different conditional sampling capabilities. If you are using the CTGAN, TVAE or CopulaGAN synthesizers, the SDV may unable to complete your conditional sampling request in some instances.&#x20;

Neural network-based synthesizers use a *reject sampling* approach: They sample synthetic data freely, keep the rows that match your conditions and repeat the process as needed. This may not be efficient if the conditional values are extremely rare.

Some suggestions:

* Use a larger `batch_size` or `max_tries_per_batch`. The more rare your conditions, the more attempts the SDV will have to make.
* Try using the [GaussianCopulaSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer) instead. This synthesizer can sample conditions mathematically instead of reject sampling, which is more efficient.

#### Are you including constraints?

Any synthesizer that has [constraints](https://docs.sdv.dev/sdv/single-table-data/modeling/customizations/constraints) may have to use *reject sampling* to ensure the rows are valid. This can slow down the process.

Suggestions:&#x20;

* Use a larger `batch_size` or `max_tries_per_batch`. The more constraints you have, the more attempts the SDV will have to make.
* Consider removing constraints and refitting your synthesizer. This will help if you are conditioning on desired columns that were involved in a constraint.

#### Are you requesting data that is out of range?

The SDV synthesizers are designed to learn patterns from the input data that you've added during `fit`, including the the min and max ranges of each column. If you are requesting conditional values that are outside of the ranges, the synthesizer may not be able to accommodate your request.

Check to see whether you values at out of range. If so, you may have more success if you fit your synthesizer without enforcing min max values. For more details, see the [Modeling API](https://docs.sdv.dev/sdv/single-table-data/modeling).

```python
synthesizer = GaussianCopulaSynthesizer(
    metadata,
    enforce_min_max_values=False
)
```

#### Need more help?

If you encounter any problems, please [visit our forum](https://forum.datacebo.com/). You can browse existing issues to see if there's a solution. If you cannot find what you're looking for, create an account and start new thread.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sdv.dev/sdv/single-table-data/sampling/conditional-sampling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
