Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook
On this page
  • Creating a synthesizer
  • Parameter Reference
  • get_parameters
  • get_metadata
  • Learning from your data
  • fit
  • get_loss_values
  • Saving your synthesizer
  • save
  • TVAESynthesizer.load
  • What's next?
  • FAQs
  1. Single Table Data
  2. Modeling
  3. Synthesizers

TVAESynthesizer

PreviousCTGANSynthesizerNext❖ XGCSynthesizer

Last updated 7 months ago

Copyright (c) 2023, DataCebo, Inc.

The TVAE Synthesizer uses a variational autoencoder (VAE)-based, neural network techniques to train a model and generate synthetic data.

from sdv.single_table import TVAESynthesizer

synthesizer = TVAESynthesizer(metadata)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in a object as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = TVAESynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    epochs=500
)

Parameter Reference

enforce_min_max_values: Control whether the synthetic data should adhere to the same min/max boundaries set by the real data

(default) True

The synthetic data will contain numerical values that are within the ranges of the real data.

False

enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data

(default) True

The synthetic data will be rounded to the same number of decimal digits that were observed in the real data

False

The synthetic data may contain more decimal digits than were observed in the real data

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

epochs: Number of times to train the VAE. Each new epoch can improve the model.

(default) 300

Run all the data through the VAE 300 times during training

<number>

Train for a different number of epochs. Note that larger numbers will increase the modeling time.

verbose: Control whether to print out the results of each epoch. You can use this to track the training time as well as the improvements per epoch.

(default) False

Do not print out any results

True

Print out the loss values per epoch. The loss values indicate how well the VAE is currently performing, lower values indicate higher quality.

(default) True

If available, use CUDA to speed up modeling time. If it's not available, then there will be no difference.

False

Do not use CUDA to speed up modeling time.

Looking for more customizations? Other settings are available to fine-tune the architecture of the neural network used to model the data. Click the section below to expand.

Click to expand additional VAE customization options

These settings are specific to the neural network. Use these settings if you want to optimize the technical architecture and modeling.

batch_size: Number of data samples to process in each step.

compress_dims: Size of each hidden layer in the encoder. Defaults to (128, 128).

decompress_dims: Size of each hidden layer in the decoder. Defaults to (128, 128).

embedding_dim: Size of the embedding dimension used by the encoder and decoder. Defaults to 128.

l2scale: Regularization term. Defaults to 1e-5.

loss_factor: Multiplier for the reconstruction error. Defaults to 2.

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()
{
    'enforce_rounding': False,
    'epochs': 500,
    ...
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

Output (None)

get_loss_values

After fitting, you can access the loss values computed during each epoch and batch.

Parameters (None)

Output A pandas.DataFrame object containing epoch number, batch number and loss value.

synthesizer.get_loss_values()
Epoch     Batch    Loss 
1         1        1.7863
1         2        1.5484
1         3        1.3633
...

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

TVAESynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file

Parameters

  • (required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer, as a TVAESynthesizer object

from sdv.single_table import TVAESynthesizer

synthesizer = TVAESynthesizer.load(
    filepath='my_synthesizer.pkl'
)

What's next?

Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.

FAQs

What happens if columns don't contain numerical data?

This synthesizer models non-numerical columns, including columns with missing values.

How many epochs should I train for?

Unfortunately, there is no one-size-fits-all solution for this question! The optimal number of epochs depends on both the complexity of your dataset and the metrics you are using to quantify success.

Can I call fit again even if I've previously fit some data?

Yes, even if you're previously fit data, you should be able to call the fit method again.

If you do this, the synthesizer will start over from scratch and fit the new data that you provide it. This is the equivalent of creating a new synthesizer and fitting it with new data.

How do I cite TVAE?

The TVAE model was introduced in the same paper as CTGAN.

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.

@inproceedings{tvae,
   title={Modeling Tabular data using Conditional GAN},
   author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
   booktitle={Advances in Neural Information Processing Systems},
   year={2019}
}

The synthetic data may contain numerical values that are less than or greater than the real data. Note that you can still set the limits on individual columns using .

For example [, ].

For all options, see the .

cuda: Whether to use , a parallel computing platform that allows you to speed up modeling time using the GPU

Output A object

(required) data: A object containing the real data that the machine learning model will learn from

Technical Details: This synthesizer uses the TVAE to learn a model from real data and create synthetic data. The TVAE uses variational autoencoders (VAEs) to model data, as described in the paper which was presented at the NeurIPS conference in 2019.

After training your synthesizer, you can now sample synthetic data. See the section for more details.

For more details, see .

Although the TVAE algorithm is designed for complete data with non-missing values, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see .

Our experiments suggest that increasing the number of epochs helps up until a certain inflection point. After this, there is no significant improvement. Keep in mind that increasing the epochs also increases the training time. More information is available in .

Metadata
CUDA
Metadata
pandas DataFrame
Modeling Tabular data using Conditional GAN
Sampling
Customizations
Advanced Features
this discussion
Constraints
"en_US"
"fr_CA"
Faker docs