Links

Sampling

Use these sampling methods to create synthetic data from any of the single table models. You can use multiple functions to create synthetic data that is customized for your use case.

Create Realistic Data

Create realistic synthetic data data that follows the same format and mathematical properties as the real data.

sample

Use this function to create synthetic data that mimics the real data
synthetic_data = synthesizer.sample(
num_rows=1_000_000,
batch_size=1_000
)
Parameters
  • (required) num_rows: An integer >0 that specifies the number of rows to synthesize
  • batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
  • max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
  • output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.
Returns A pandas DataFrame object with synthetic data. The synthetic data mimics the real data.

Simulate Scenarios

Specify exact conditions to simulate a hypothetical scenario.

Define Your Conditions

Create a Condition object to specify an exact condition that you want to include in the synthetic data.
from sdv.sampling import Condition
suite_guests_with_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': True}
)
suite_guests_without_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': False}
)
Parameters
  • (required) num_rows: The number of rows that need to be included in the scenario
  • (required) column_values: A dictionary with the scenario. The keys should be a column names and the values should be the exact data that the column should have.
You may require multiple conditions. Define as many Condition objects as you need to construct your hypothetical scenario. For example, you may want to create a 50/50 mix of active an inactive users of various tiers.

sample_from_conditions

Use this function to simulate a hypothetical scenario using synthetic data. Your scenario is encoded as Condition objects.
synthetic_data = custom_synthesizer.sample_from_conditions(
conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
output_file_path='synthetic_simulated_scenario.csv'
)
Parameters
  • (required) conditions: A list of Condition objects that specify the exact values that you want to fix and the number of rows to synthesize.
  • batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
  • max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
  • output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.
Returns A pandas DataFrame object with synthetic data. The synthetic data is simulated based on the conditions.

Reference Known Data

Synthesize data based on known, reference columns.

sample_remaining_columns

Use this function to sample remaining columns based on known, reference columns.
import pandas as pd
reference_data = pd.DataFrame(data={
'room_type': ['SUITE', 'SUITE', 'DELUXE', 'BASIC', 'BASIC'],
'has_rewards': [True, True, True, False, False]
})
synthetic_data = synthesizer.sample_remaining_columns(
known_columns=reference_data,
max_tries_per_batch=500
)
Parameters
  • (required) known_columns: A pandas DataFrame object with columns that you already know the values for.
  • batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
  • max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
  • output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.
Returns A pandas DataFrame object with synthetic data. The synthetic data is based on the known, reference columns.

Controlling Randomization

Every time you use any of the sampling methods, the synthetic data will be different than the previous runs.

reset_sampling

Use this function to reset the randomization. After calling this, any sampling method generates the same data as before. For example in the code below, synthetic_data1 and synthetic_data2 are the same.
synthesizer.reset_sampling()
synthetic_data1 = synthesizer.sample(num_rows=10)
synthesizer.reset_sampling()
synthetic_data2 = synthesizer.sample(num_rows=10)
Parameters None
Returns None. Resets the synthesizer.
Copyright (c) 2023, DataCebo, Inc.