Links

HMASynthesizer

The HMA Synthesizer uses hierarchical ML algorithm to learn from real data and generate synthetic data. The algorithm uses classical statistics.
from sdv.multi_table import HMASynthesizer
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample()
Is the HMASynthesizer suited for your dataset? The HMASynthesizer is designed to capture correlations between different tables with high quality. The algorithm is optimized for datasets with around 5 tables and 1 level of depth (eg. a parent and its child table). You may find the modeling time will increase if you have multiple levels of tables and more columns.
Want to model more complex graphs? If you are looking for solutions with a larger schema, with many tables and complex relational structure, please contact us at [email protected].

Creating a synthesizer

When creating your synthesizer, you are required to pass in a Multi Table Metadata object as the first argument.
synthesizer = HMASynthesizer(metadata)
All other parameters are optional. You can include them to customize the synthesizer.

Parameter Reference

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
For example ["en_US", "fr_CA"].
For all options, see the Faker docs.
synthesizer = HMASynthesizer(
metadata,
locales=['en_US', 'en_CA', 'fr_CA']
)
verbose: A boolean describing whether or not to show the progress when fitting the synthesizer.
(default) True
Show the progress when fitting the synthesizer. You'll see printed progress bars during every stage of the fitting process: Preprocessing, learning relationships and modeling tables.
False
Do not show progress. The synthesizer will fit the data silently.

set_table_parameters

The HMA Synthesizer is a multi-table algorithm that models each individual table as well as the connections between them. You can get and set the parameters for each individual table.
Parameters
  • (required) table_name: A string describing the name of the table
  • table_synthesizer: The single table synthesizer to use for modeling the table
    • (default) 'GaussianCopulaSynthesizer': Use the GaussianCopulaSynthesizer to model the single table
    • No other options are avialable for the HMASynthesizer
  • table_parameters: A dictionary mapping the name of the parameter (string) to the value of the parameter (various). See GaussianCouplaSynthesizer for more details.
Output (None)
synthesizer.set_table_parameters(
table_name='guests',
table_synthesizer='GaussianCopulaSynthesizer',
table_parameters={
'enforce_min_max_values': True,
'default_distribution': 'truncnorm',
'numerical_distributions': {
'checkin_date': 'uniform',
'amenities_fee': 'beta' }
}
)

get_table_parameters

Use this function to access the custom parameters you have included for the synthesizer
Parameters
  • (required) table_name: A string describing the name of the table
Output A dictionary with the parameter names and the values
synthesizer.get_parameters(table_name='users')
{
'enforce_min_max_values': True,
'default_distribution': 'truncnorm',
'numerical_distributions': {
'checkin_date': 'uniform',
'amenities_fee': 'beta'
}
}
The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer
Parameters None
Output A MultiTableMetadata object
metadata = synthesizer.get_metadata()
The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters
  • (required) data: A dictionary mapping each table name to a pandas.DataFrame containing the real data that the machine learning model will learn from
Output (None)
Technical Details: HMA, which stands for Hierarchical Modeling Algorithm, uses a recursive technique to model the parent-child relationships of a multi-table datasets. At a base level, it uses Gaussian Copulas to model individual tables.
See:

Accessing learned distributions

After fitting this synthesizer, you can access the marginal distributions using the get_learned_distributions method.
synthesizer.get_learned_distributions()
This returns a dictionary mapping each table name to a dictionary of parameters. The parameters include a column name, along with a distribution name (eg. beta, uniform, gamma, etc.) and learned parameters.
{
'guests': {
'amenities_fee': {
'distribution': 'beta',
'learned_parameters': { 'a': 4, 'b': 5 }
},
...
},
'hotels': {
<column_name>: {
'distribution': <name> ,
'learned_parameters: { ... }
}
}
}
For more information about the distributions and their parameters, visit the Copulas library.
Learned parameters are only available for parametric distributions. Parameters are not available for the gaussian_kde distribution.

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.
Parameters
  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl
Output (None) The file will be saved at the desired location
synthesizer.save(
filepath='my_synthesizer.pkl'
)

HMASynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file
Parameters
  • (required) filepath: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a HMASynthesizer object
from sdv.multi_table import HMASynthesizer
synthesizer = HMASynthesizer.load(
filepath='my_synthesizer.pkl'
)

What's next?

After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.
Want to improve your synthesizer? Input logical rules in the form of constraints, and customize the transformations used for pre- and post-processing the data.
For more details, see Advanced Features.

FAQs

How do I cite the HMA?
Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic data vault. DSAA, 2016.
@inproceedings{
HMA,
title={The Synthetic data vault},
author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
year={2016},
pages={399-410},
doi={10.1109/DSAA.2016.49},
month={Oct}
}
What happens if columns don't contain numerical data?
This synthesizer models non-numerical columns, including columns with missing values.
Although the HMA algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see Advanced Features.
Copyright (c) 2023, DataCebo, Inc.