Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Creating a synthesizer
  • Parameter Reference
  • get_parameters
  • get_metadata
  • Learning from your data
  • fit
  • Saving your synthesizer
  • save
  • SegmentSynthesizer.load
  • What's next?
  • FAQs
  1. Single Table Data
  2. Modeling
  3. Synthesizers

❖ SegmentSynthesizer

Previous❖ XGCSynthesizerNext* DayZSynthesizer

Last updated 21 days ago

The SegmentSynthesizer calculates different segments of real data, and computes a different model for each one. You can supply any single-table synthesizer for computing the per-segment model. Use this when your real data is highly segmented, containing different patterns for each.

from sdv.single_table import SegmentSynthesizer

synthesizer = SegmentSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in a object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = SegmentSynthesizer(
    metadata, # required
    n_segments=3,
    columns_for_segmentation=['age', 'income'],
    per_segment_synthesizer='GaussianCopulaSynthesizer'
)

Parameter Reference

enforce_min_max_values: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

(default) True

The synthetic data will contain numerical values that are within the ranges of the real data.

False

enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data

(default) True

The synthetic data will be rounded to the same number of decimal digits that were observed in the real data

False

The synthetic data may contain more decimal digits than were observed in the real data

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

n_segments: The number of segments to compute. The synthesizer automatically computes the segments based on the data patterns. In some cases, it may determine that the data requires less segmentation than specified, so n_segments acts as the max.

(default) 5

Break up the real data into 5 segments

<int>

Break up the data into the provided number of segments

columns_for_segmentation: A list of column names that should be used to compute the segments. The column names should be listed in the metadata, and contain statistical information (i.e. contain data that is numerical, datetime, categorical, or boolean).

(default) None

Use all the statistical column in the data to create segments

<list>

Use only the column names provided to create segements

per_segment_synthesizer: A string with the type of synthesizer to use for modeling each individual segment

(default) 'GaussianCouplaSynthesizer'

Use the GaussianCopulaSynthesizer to model each segment.

<synthesizer_name>

per_segment_synthesizer_params: A dictionary of parameters to use for each of the per segment synthesizers.

(default) None

Use the default parameters for the synthesizer

<dictionary>

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()
{
    'n_segements': 5,
    'per_segment_synthesizer': 'GaussianCopulaSynthesizer',
    ...
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

Output (None)

synthesizer.fit(data)

Technical Details: This synthesizer uses an algorithm to segment your real data into different groups. Each group may have different patterns. This synthesizer models each segment separately by calling upon other single-table synthesizers.

Since each segment is ultimately modeled separately, the overall fit time is expected to increase linearly with the number of segments.

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

SegmentSynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file

Parameters

  • (required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer, as a SegmentSynthesizer object

from sdv.single_table import SegmentSynthesizer

synthesizer = SegmentSynthesizer.load(
    filepath='my_synthesizer.pkl'
)

What's next?

Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

FAQs

What happens if columns don't contain numerical data?

This synthesizer models non-numerical columns, including columns with missing values.

Most algorithms that you can use for the per-segment modeling are designed for numerical data. This synthesizer ensures that all segments are appropriately converted to numerical data before modeling using Reversible Data Transformers (RDTs).

Currently, it is not posisble to access and modify these transformations. Though this feature is coming soon!

Can I call fit again even if I've previously fit some data?

Yes, even if you're previously fit data, you should be able to call the fit method again.

If you do this, the synthesizer will start over from scratch and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.

The synthetic data may contain numerical values that are less than or greater than the real data. Note that you can still set the limits on individual columns using .

For example [, ].

For all options, see the .

Supply a synthesizer name from the list of . For example 'XGCSynthesizer' or 'CTGANSynthesizer'.

Update the default parameters for the synthesizer you've chosen by providing a dictionary of key/values pairs for each parameter. Refer to the docs for your synthesizer for possible parameters. For example, for you can supply: {'default_distribution': 'norm'}.

Output A object

(required) data: A object containing the real data that the machine learning model will learn from

After training your synthesizer, you can now sample synthetic data. See the section for more details.

For more details, see .

Metadata
pandas DataFrame
Sampling
Customizations
Constraints
"en_US"
"fr_CA"
Faker docs
single table synthesizers
GaussianCopulaSynthesizer
Metadata

❖ SDV Enterprise Bundle. This feature is available as part of the XSynthesizers Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page.

XSynthesizers Bundle