Custom Synthesizers

The SDGym allows you to benchmark your custom synthesizer. Follow this guide to write your synthesizer in the correct format.

Your synthesizer should work on all single table datasets. The SDGym does not currently support sequential or multi-table data.

Creating your synthesizer

Your synthesizer should work by training a machine learning model using the real data. Then, it should sample synthetic data using the model. You will need to provide both the training and sampling logic using the guidelines below.

Step 1: Training

Write a function that trains a model using the real data and any information present in the Metadata. It outputs a fully trained synthesizer, represented as any kind of object.

Parameters

  • (required) data: A pandas.DataFrame with the real data

  • (required) metadata: A Metadata dictionary that provides information about the column types in the real data

Output Any object that represents your fully trained synthesizer

def get_trained_synthesizer(data, metadata):
  # create an object to represent your synthesizer
  # train it using the data and metadata
  return synthesizer

Step 2: Sampling

Write a function that accepts the trained synthesizer (from the previous step) and uses it to generate synthetic data of a specified length.

Parameters

  • (required) synthesizer: The synthesizer object from the previous step

  • (required) n_rows: An integer >0 that represents the number of synthetic data rows to create

Output A pandas.DataFrame object with the synthetic data. It should contain the specified number of rows.

def sample_from_synthesizer(synthesizer, n_rows):
    # use the trained synthesizer object to sample
    # n_rows of synthetic data
    return synthetic_data

Step 3: Creating your synthesizer

Once you've defined your logic, put it all together to create an SDGym synthesizer. Use the create_single_table_synthesizer function.

Parameters

  • (required) get_trained_synthesizer_fn: A function that creates a trained synthesizer

  • (required) sample_from_synthesizer_fn: A function that creates synthetic data

  • (required) display_name: A string representing the name of the synthesizer. This display name will be used to identify your custom synthesizer in the benchmarking results.

Output A class object that represents your custom SDGym synthesizer.

from sdgym import create_single_table_synthesizer

MyCustomSynthesizerClass = create_single_table_synthesizer(
    get_trained_synthesizer_fn=get_trained_synthesizer,
    sample_from_synthesizer_fn=sample_from_synthesizer,
    display_name='MyCustomSynthesizer'
)

Using your synthesizer

Once you've created your custom SDGym synthesizer, use it in a benchmarking run by providing the custom_synthesizers parameter. Pass in the classes directly.

import sdgym

sdgym.benchmark_single_table(
    custom_synthesizers=[MyCustomSynthesizerClass]
)

Results

Results from your custom synthesizer will be labeled by the provided display_name.

Synthesizer                Dataset   Dataset_Size_MB   Model_Time   Peak_Memory_KB   Model_Size_MB    Sample_Time    Evaluate_Time   Quality Score   NewRowSynthesis
Custom:MyCustomSynthesizer alarm     34.5              45.45        100201           0.340            2012.2         1001.2          0.71882         0.99901           

See Interpreting Results for more details.

Last updated

© Copyright 2023, DataCebo, Inc.