Sample Realistic Data

Create realistic synthetic data data that follows the same format and mathematical properties as the real data.

sample

Use this function to create synthetic data that follows the same format and mathematical properties as the real data.

synthetic_data = synthesizer.sample(
    num_rows=1_000_000,
    batch_size=1_000
)

Parameters

  • (required) num_rows: An integer >0 that specifies the number of rows to synthesize

  • batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.

  • max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.

  • output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

Returns A pandas DataFrame object with synthetic data. The synthetic data mimics the real data.

reset_sampling

Use this function to reset any randomization in sampling. After calling this, your synthesizer will generate the same data as before. For example in the code below, synthetic_data1 and synthetic_data2 are the same.

synthesizer.reset_sampling()
synthetic_data1 = synthesizer.sample(num_rows=10)

synthesizer.reset_sampling()
synthetic_data2 = synthesizer.sample(num_rows=10)

Parameters None

Returns None. Resets the synthesizer.

If you have your synthesizer, it will reset sampling automatically for you. The next time you load and sample, you will receive the same synthetic data.

Last updated

Copyright (c) 2023, DataCebo, Inc.