# ❖ SegmentSynthesizer

{% hint style="info" %}
❖ **SDV Enterprise Bundle**. This feature is available as part of the **XSynthesizers Bundle**, an optional add-on to SDV Enterprise. For more information, please visit the [XSynthesizers Bundle](https://docs.sdv.dev/sdv/explore/sdv-bundles/xsynthesizers) page.
{% endhint %}

The SegmentSynthesizer calculates different segments of real data, and computes a different model for each one. You can supply any single-table synthesizer for computing the per-segment model. Use this when your real data is highly segmented, containing different patterns for each.

```python
from sdv.single_table import SegmentSynthesizer

synthesizer = SegmentSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)
```

## Creating a synthesizer

When creating your synthesizer, you are required to pass in a [Metadata](https://docs.sdv.dev/sdv/single-table-data/data-preparation/creating-metadata) object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

```python
synthesizer = SegmentSynthesizer(
    metadata, # required
    segmentation_params={
        'method': 'exact_values',
        'column_name': 'made_purchase'
    },
    per_segment_synthesizer='GaussianCopulaSynthesizer'
)
```

### Parameter Reference

**`enforce_min_max_values`**: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>The synthetic data will contain numerical values that are within the ranges of the real data.</td></tr><tr><td><code>False</code></td><td>The synthetic data may contain numerical values that are less than or greater than the real data.</td></tr></tbody></table>

**`enforce_rounding`**: Control whether the synthetic data should have the same number of decimal digits as the real data

<table data-header-hidden><thead><tr><th width="179"></th><th></th></tr></thead><tbody><tr><td>(default) <code>True</code></td><td>The synthetic data will be rounded to the same number of decimal digits that were observed in the real data</td></tr><tr><td><code>False</code></td><td>The synthetic data may contain more decimal digits than were observed in the real data</td></tr></tbody></table>

**`locales`**: A list of locale strings. Any PII columns will correspond to the locales that you provide.

<table data-header-hidden><thead><tr><th width="218"></th><th></th></tr></thead><tbody><tr><td>(default) <code>['en_US']</code></td><td>Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)</td></tr><tr><td><code>&#x3C;list></code></td><td><p>Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.</p><p></p><p>For example <code>[</code><a href="https://faker.readthedocs.io/en/master/locales/en_US.html"><code>"en_US"</code></a><code>,</code> <a href="https://faker.readthedocs.io/en/master/locales/fr_CA.html"><code>"fr_CA"</code></a><code>]</code>. </p><p>For all options, see the <a href="https://faker.readthedocs.io/en/master/locales.html">Faker docs</a>.</p></td></tr></tbody></table>

**`segmentation_params`**: A dictionary of parameters that govern how to perform the segmentation. This allows for one of two possible methods.

<table data-header-hidden><thead><tr><th width="226"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'algorithmic'</code> segmentation</td><td><p>Allow the synthesizer to algorithmically compute segments based on the data. You can optionally provide:</p><ul><li><code>n_segments</code>: The number of segments (defaults to 3)</li><li><code>column_names</code>: A list of column names to use for the algorithmic segmentation (defaults to all columns)</li></ul><pre class="language-python"><code class="lang-python">segmentation_params={
    'method': 'algorithmic', # required
    'n_segments': 5, # defaults to 3
    'column_names': ['age', 'income'] # defaults to all
}
</code></pre></td></tr><tr><td><code>'exact_values'</code> segmentation</td><td><p>Supply a categorical column that already contains the segments. The exact values from that column are used to identify the segments. </p><pre class="language-python"><code class="lang-python">segmentation_params={
    'method': 'exact_values', # required
    'column_name': 'made_purchase' # required
}
</code></pre></td></tr></tbody></table>

**`per_segment_synthesizer`**: A string with the type of synthesizer to use for modeling each individual segment. *You can update individual segment synthesizers later using the `set_synthesizer_for_segment` method, detailed below.*

<table data-header-hidden><thead><tr><th width="325"></th><th></th></tr></thead><tbody><tr><td>(default) <code>'GaussianCouplaSynthesizer'</code></td><td>Use the GaussianCopulaSynthesizer to model each segment.</td></tr><tr><td><code>&#x3C;synthesizer_name></code></td><td>Supply a synthesizer name from the list of <a href="">single table synthesizers</a>. For example <code>'XGCSynthesizer'</code> or <code>'CTGANSynthesizer'</code>.</td></tr></tbody></table>

**`per_segment_synthesizer_params`**: A dictionary of parameters to use for each of the per segment synthesizers.

<table data-header-hidden><thead><tr><th width="267"></th><th></th></tr></thead><tbody><tr><td>(default) <code>None</code></td><td>Use the default parameters for the synthesizer</td></tr><tr><td><code>&#x3C;dictionary></code></td><td>Update the default parameters for the synthesizer you've chosen by providing a dictionary of key/values pairs for each parameter. This is different for each synthesizer. Refer to the <a href="">synthesizer's API</a>.<br><br>For example, for <a href="#gaussiancopulasynthesizer.load">GaussianCopulaSynthesizer</a> you can supply: <code>{'default_distribution': 'norm'}</code>.</td></tr></tbody></table>

### set\_synthesizer\_for\_segment

Use this function to set the algorithm to use for a specific segment of the data. This is most useful if you are using the `'exact_values'` segmentation, as you already know the segments that the synthesizer will use.

**Parameters**

* (required) `segment_name`: The exact categorical value that corresponds to the segment. This should be a value that appears in the column used for segmentation.
* (required) `synthesizer_name`: A string with the type of synthesizer to use for modeling each individual segment. For example `'GaussianCopulaSynthesizer'` or `'CTGANSynthesizer'`.
* `synthesizer_params`: A dictionary of parameters to use for the synthesizer. This is different for each synthesizer. Refer to the [synthesizer's API](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers).

**Output**: None. The synthesizer corresponding to the segment is set.

```python
synthesizer.set_synthesizer_for_segment(
    segment_name=True, # everything labeled as True is one segment
    synthesizer_name='CTGANSynthesizer' # the name of any SDV single-table synthesizer
    synthesizer_params={
        'epochs': 100
    }
)
```

### get\_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

**Parameters** None

**Output** A dictionary with the parameter names and the values

```python
synthesizer.get_parameters()
```

```python
{
    'n_segements': 5,
    'per_segment_synthesizer': 'GaussianCopulaSynthesizer',
    ...
}
```

{% hint style="info" %}
The returned parameters are a copy. Changing them will not affect the synthesizer.
{% endhint %}

### get\_metadata

Use this function to access the metadata object that you have included for the synthesizer

**Parameters** None

**Output** A [Metadata](https://docs.sdv.dev/sdv/concepts/metadata) object

```python
metadata = synthesizer.get_metadata()
```

{% hint style="info" %}
The returned metadata is a copy. Changing it will not affect the synthesizer.
{% endhint %}

## Learning from your data

To learn a machine learning model based on your real data, use the `fit` method.

### fit

**Parameters**

* (required) `data`: A [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the real data that the machine learning model will learn from

**Output** (None)

```python
synthesizer.fit(data)
```

{% hint style="info" %}
**Technical Details:** This synthesizer uses an algorithm to segment your real data into different groups. Each group may have different patterns. This synthesizer models each segment separately by calling upon other single-table synthesizers.

Since each segment is ultimately modeled separately, the overall fit time is expected to increase linearly with the number of segments.
{% endhint %}

## Saving your synthesizer

Save your trained synthesizer for future use.

### save

Use this function to save your trained synthesizer as a Python pickle file.

**Parameters**

* (required) `filepath`: A string describing the filepath where you want to save your synthesizer. Make sure this ends in `.pkl`&#x20;

**Output** (None) The file will be saved at the desired location

```python
synthesizer.save(
    filepath='my_synthesizer.pkl'
)
```

### load (utility function)

Use this utility function to load a trained synthesizer from a Python pickle file. After loading your synthesizer, you'll be able to sample synthetic data from it.

**Parameters**

* (required) `filepath`: A string describing the filepath of your saved synthesizer

**Output** Your synthesizer object

```python
from sdv.utils import load_synthesizer

synthesizer = load_synthesizer(
    filepath='my_synthesizer.pkl'
)
```

*This utility function works for any SDV synthesizer.*

## What's next?

After training your synthesizer, you can now sample synthetic data. See the [Sampling](https://docs.sdv.dev/sdv/single-table-data/sampling) section for more details.

```python
synthetic_data = synthesizer.sample(num_rows=10)
```

{% hint style="info" %}
**Want to improve your synthesizer?** Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

For more details, see [Customizations](https://docs.sdv.dev/sdv/single-table-data/modeling/customizations).
{% endhint %}

## FAQs

<details>

<summary>What happens if columns don't contain numerical data?</summary>

This synthesizer models non-numerical columns, including columns with missing values.

Most algorithms that you can use for the per-segment modeling are designed for numerical data. This synthesizer ensures that all segments are appropriately converted to numerical data before modeling using Reversible Data Transformers (RDTs).&#x20;

*Currently, it is not posisble to access and modify these transformations. Though this feature is coming soon!*

</details>

<details>

<summary>Can I call <code>fit</code> again even if I've previously fit some data?</summary>

Yes, even if you're previously fit data, you should be able to call the `fit` method again.

If you do this, the synthesizer will **start over from scratch** and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.

</details>
