HMASynthesizer
Last updated
Last updated
The HMA Synthesizer uses hierarchical ML algorithm to learn from real data and generate synthetic data. The algorithm uses classical statistics.
Is the HMASynthesizer suited for your dataset? The HMASynthesizer is designed to capture correlations between different tables with high quality. The algorithm is optimized for datasets with around 5 tables and 1 level of depth (eg. a parent and its child table). If you have a complex schema, use the function to create a smaller, simpler dataset for HMASynthesizer.
Want to model more complex graphs? You can to inquire about our paid SDV plans. SDV Enterprise supports work many more tables, so you will not have to use simplify_schema
on the paid plan.
When creating your synthesizer, you are required to pass in a object as the first argument.
All other parameters are optional. You can include them to customize the synthesizer.
locales
: A list of locale strings. Any PII columns will correspond to the locales that you provide.
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
verbose
: A boolean describing whether or not to show the progress when fitting the synthesizer.
(default) True
Show the progress when fitting the synthesizer. You'll see printed progress bars during every stage of the fitting process: Preprocessing, learning relationships and modeling tables.
False
Do not show progress. The synthesizer will fit the data silently.
The HMA Synthesizer is a multi-table algorithm that models each individual table as well as the connections between them. You can get and set the parameters for each individual table.
Parameters
(required) table_name
: A string describing the name of the table
Output (None)
Which distributions can I use with the HMA? Please note that the HMA algorithm is only compatible with parametric distributions that have a predefined number of parameters. You will not be able to use the 'gaussian_kde'
distribution with HMA.
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters (None)
Output A dictionary with the table names and parameters for each table.
Use this function to access the all parameters a table synthesizer uses -- those you have provided as well as the default ones.
Parameters
(required) table_name
: A string describing the name of the table
Output A dictionary with the parameter names and the values
Use this function to access the metadata object that you have included for the synthesizer
Parameters None
To learn a machine learning model based on your real data, use the fit
method.
Parameters
(required) data
: A dictionary mapping each table name to a pandas.DataFrame containing the real data that the machine learning model will learn from
Output (None)
After fitting this synthesizer, you can access the marginal distributions that were learned to estimate the shape of each column.
Parameters
(required) table_name
: A string with the name of the table
Output A dictionary that maps the name of each learned column to the distribution that estimates its shape
Save your trained synthesizer for future use.
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required) filepath
: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl
Output (None) The file will be saved at the desired location
Use this function to load a trained synthesizer from a Python pickle file
Parameters
(required) filepath
: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a HMASynthesizer object
For example [
,
]
.
For all options, see the .
table_parameters
: A dictionary mapping the name of the parameter (string) to the value of the parameter (various). See for more details.
Output A object
for more information on the GaussianCopula framework
The , published in the International Conference on Data Science and Advance Analytics, October 2016
For more information about the distributions and their parameters, visit the.
After training your synthesizer, you can now sample synthetic data. See the section for more details.
For more details, see .
Although the HMA algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs). To access and modify the transformations, see .