Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKey
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Creating a synthesizer
  • Parameter Reference
  • get_parameters
  • get_metadata
  • Learning from your data
  • fit
  • Saving your synthesizer
  • save
  • BootstrapSynthesizer.load
  • What's next?
  • FAQ
  1. Single Table Data
  2. Modeling
  3. Synthesizers

❖ BootstrapSynthesizer

Previous❖ XGCSynthesizerNext❖ SegmentSynthesizer

Last updated 1 day ago

The BootstrapSynthesizer is a synthesizer specifically designed to work when you only have a few rows of data — or if your data is "short and wide", containing more columns than rows. This synthesizer internally bootstraps your real data, and then uses the bootstrapped data to build a model. The modeling part is compatible with any other .

from sdv.single_table import BootstrapSynthesizer

synthesizer = BootstrapSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in a object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = BootstrapSynthesizer(
    metadata, # required
    num_rows_bootstrap=1000,
    bootstrap_noise_amt=1.5,
    data_synthesizer='GaussianCopulaSynthesizer',
    enforce_min_max_values=True,
    synthesize_missing_values=False
)

Parameter Reference

num_rows_bootstrap : Specify the number of additional rows to bootstrap before modeling the data.

(default) 1000

Bootstrap the original data by creating 1000 rows of additional data

<integer>

Create the desired number of bootstrapped rows before building the model

bootstrap_noise_amount : The amount of noise to add when bootstrapping the data. Some noise is necessary to provide a greater diversity of data points for modeling.

(default) 1.5

When bootstrapping the data, add noise that is equal to 1.5x the standard deviation of each row.

<float>

Add the desired amount of noise to the bootstrapped data. This is the multiplier to the standard deviation, so 1.5 means 1.5x the standard deviation, 2 means 2x the standard deviation, etc.

data_synthesizer : The single-table synthesizer to use when modeling the bootstrapped data.

(default) 'GaussianCopulaSynthesizer'

<synthesizer_name>

data_synthesizer_params : A dictionary of parameters to use for the synthesizer

(default) None

Use the default parameters for the synthesizer

<dictionary>

enforce_min_max_values: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

(default) True

The synthetic data will contain numerical values that are within the ranges of the real data.

False

The synthetic data may contain numerical values that are less than or greater than the real data.

synthesize_missing_values: Control whether the synthetic data should include missing values.

(default) True

The synthetic data will contain missing values in roughly the same proportion as the original data

False

The synthetic data may should not contain any missing values for numerical and datetime columns.

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()
{
    'num_rows_bootstrap': 1000,
    'bootstrap_noise_amt': 1.5,
    'data_synthesizer': 'GaussianCopulaSynthesizer',
    'enforce_min_max_bounds': True,
    'synthesize_missing_values': False
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

Output (None)

synthesizer.fit(data)

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

BootstrapSynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file

Parameters

  • (required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer, as an XGCSynthesizer object

from sdv.single_table import BootstrapSynthesizer

synthesizer = BootstrapSynthesizer.load(
    filepath='my_synthesizer.pkl'
)

What's next?

Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

FAQ

What does it mean to bootstrap data?

Bootstrapping data means creating more examples of training data using your original rows. Bootstrapping involves duplicating the original rows, and then adding some noise to the values in order to create a larger, more varied dataset.

This is necessary because the AI-based models expect a larger number of data points for accurate learning.

Use the to build a model of the bootstrapped data

Supply a synthesizer name from the list of . For example 'XGCSynthesizer' or 'CTGANSynthesizer'.

Update the default parameters for the synthesizer you've chosen by providing a dictionary of key/values pairs for each parameter. Refer to the docs for your synthesizer for possible parameters. For example, for you can supply: {'default_distribution': 'norm'}.

Output A object

(required) data: A object containing the real data that the machine learning model will learn from

Technical Details: This synthesizer internally bootstraps your real data by adding noise, and then uses the bootstrapped data to build a model. The modeling part is compatible with any other .

After training your synthesizer, you can now sample synthetic data. See the section for more details.

For more details, see .

Metadata
pandas DataFrame
single-table synthesizer
Sampling
Customizations
GaussianCopulaSynthesizer
single table synthesizers
GaussianCopulaSynthesizer
single-table synthesizer
Metadata

❖ SDV Enterprise Bundle. This feature is available as part of the XSynthesizers Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page.

XSynthesizers Bundle