* DayZSynthesizer

*SDV Enterprise Feature. This feature is only available for licensed, enterprise users. To learn more about the SDV Enterprise features and purchasing a license, visit our website.

The Day Z Synthesizer produces synthetic data from scratch using only the metadata. This allows you start generating synthetic data from day zero: no real data or machine learning required!

from sdv.multi_table import DayZSynthesizer

synthesizer = DayZSynthesizer(metadata)
synthetic_data = synthesizer.sample(num_rows=1000)

Creating a synthesizer

When creating your synthesizer, you are required to pass in a Metadata object as the first argument. Other parameters are optional.

Parameter Reference

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

For example ["en_US", "fr_CA"].

For all options, see the Faker docs.

synthesizer = DayZSynthesizer(
    metadata,
    locales=['en_US', 'en_CA', 'fr_CA']
)

Making the data more realistic

By default, this synthesizer will randomly generate multi table data that conforms to your metadata specification. This includes referential integrity: All connections between tables will make sense.

If you'd like to generate more realistic data, you can use the methods below to add guidance.

add_numerical_bounds

Use this method to set lower and upper bounds for numerical columns

Parameters

  • (required) table_name: A string with the name of the table

  • (required) column_name: A string with the name of the column. This must be a numerical column referenced in your metadata.

  • (required) min_value: A float or int representing the minimum value.

  • (required) max_value: A float or int representing the max value

Output (None) The sampled synthetic data will follow the min and max bounds

synthesizer.add_numerical_bounds(
    table_name='guests',
    column_name='room_rate',
    min_value=30.00,
    max_value=5000.00
)

add_datetime_bounds

Use this method to set lower and upper bounds for datetime columns

Parameters

  • (required) table_name: A string with the name of the table

  • (required) column_name: A string with the name of the column. This must be a datetime column referenced in your metadata.

  • (required) start_timestamp: A string representing the earliest allowed datetime. The string must be in the same datetime format as referenced in your metadata.

  • (required) end_timestamp: A string representing the latest allowed datetime. The string must be in the same datetime format as referenced in your metadata.

Output (None) The sampled synthetic data will follow start and end bounds

synthesizer.add_datetime_bounds(
    table_name='guests',
    column_name='checkin_date',
    start_timestamp='01 Jan 2020',
    end_timestamp='31 Dec 2020'
)

set_category_values

Use this method to set the different values that are possible for categorical columns.

Parameters

  • (required) table_name: A string with the name of the table

  • (required) column_name: A string with the name of the column. This must be a categorical column referenced in your metadata.

  • (required) category_values: A list of strings representing the different unique category values that are possible. (If missing values are allowed, use the set_missing_values method instead of listing it here.)

Output (None) The sampled synthetic data will include the category values

synthesizer.set_category_values(
    table_name='guests',
    column_name='room_type',
    category_values=['BASIC', 'DELUXE', 'SUITE']
)

set_missing_values

Use this method to set the proportion of missing values to generate in a column

Parameters

  • (required) table_name: A string representing the name of the table

  • (required) column_name: A string representing the name of the column

  • (required) missing_values_proportion: A float representing the proportion of missing values

    • Any float between 0.0 and 1.0: Randomly create this proportion of missing values in the column

synthesizer.set_missing_values(
    table_name='guests',
    column_name='room_type',
    missing_values_proportion=0.1
)

Output (None) Sets the proportion of the missing values

set_cardinality

Use this function to set the cardinality of a parent/child relationship. The cardinality refers to the number of children that each parent row is allowed to have. This can be anywhere from 0 to infinity.

This function can help you create realistic data for many relationship types such as 1-1, 1-to-many, etc.

# each hotel must have 1 or more guests
synthesizer.set_cardinality(
    parent_table_name='hotels',
    child_table_name='guests',
    parent_primary_key='hotel_id',
    child_foreign_key='hotel_id',
    min_cardinality=1,
    max_cardinality=None
)

Parameters

  • (required) parent_table_name: The name of the parent table

  • (required) child_table_name: The name of the child table

  • (required) parent_primary_key: The name of the primary key in the parent

  • (required) child_foreign_key: The name of the foreign key in the child that refers to the primary key of the parent

  • min_cardinality: The minimum # of children each parent must have, must be an integer >=0

    • (default) 0: A parent row must have 0 or more children

    • <integer>: An integer representing the minimum # of children

  • max_cardinality: The maximum # of children each parent must have, must be an integer >0

    • (default) None: Do not enforce a maximum (i.e. the maximum # of children can be infinite)

    • <integer>: An integer > min_cardinality representing the maximum # of children

    • Note that If min cardinality = max cardinality, then that means there is a fixed # of children for each parent.

Output (None) Sets the min and max cardinality of the parent/child relationship, or updates it if the cardinality was already set.

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters

  • output_filepath: A string representing the name of the file to write the parameters to. We recommend storing this as a JSON file. Defaults to None, meaning that no output filepath is written.

Output A dictionary with the table names and parameters for each table.

These parameters are only for the multi-table synthesizer. To get individual table-level parameters, use the get_table_parameters function.

The returned parameters are a copy. Changing them will not affect the synthesizer.

synthesizer.get_parameters()
{
    'locales': ['en_US', 'fr_CA'],
    ...
}

get_table_parameters

Use this function to access the all parameters a table synthesizer uses -- those you have provided as well as the default ones.

Parameters

  • (required) table_name: A string describing the name of the table

Output A dictionary with the parameter names and the values

synthesizer.get_table_parameters(table_name='users')
{
    'synthesizer_name': 'DayZSynthesizer',
    'synthesizer_parameters': {
        'columns': {
            ...
        }
    }
}

Saving your synthesizer

Save your synthesizer for future use

save

Use this function to save your synthesizer as a Python pickle file.

Parameters

  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

DayZSynthesizer.load

Use this function to load a synthesizer from a Python pickle file

Parameters

  • (required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer, as a DayZSynthesizer object

from sdv.multi_table import DayZSynthesizer

synthesizer = DayZSynthesizer.load(
    filepath='my_synthesizer.pkl'
)

Sample synthetic data

Sample any amount of synthetic data

sample

Use this method to sample synthetic data

Parameters

  • num_rows: An integer >0 that specifies the number of rows to synthesize for each table by default. Defaults to 1000, meaning 1000 rows will be generated for each table

  • num_rows_per_table: A dictionary mapping each table name to the number of rows to generate for the table. If a table is not in this dictionary, we'll fall back to the num_rows

Output A dictionary that maps each table name (string) to a pandas DataFrame object with synthetic data for that table. The synthetic data mimics the real data.

synthetic_data = synthesizer.sample(
    num_rows_per_table={
        'hotels': 1000,
        'guests': 2500,
    }
)

What's next?

This synthesizer has limited functionality. It is not compatible with conditional sampling or constraints.

If you wish to use these features, we recommend using real data and machine learning to train an HSASynthesizer.

Last updated

Copyright (c) 2023, DataCebo, Inc.