Sampling
Use these sampling methods to create synthetic data from any of the single table models. You can use multiple functions to create synthetic data that is customized for your use case.
Create realistic synthetic data data that follows the same format and mathematical properties as the real data.
Use this function to create synthetic data that mimics the real data
synthetic_data = synthesizer.sample(
num_rows=1_000_000,
batch_size=1_000
)
Parameters
- (required)
num_rows
: An integer >0 that specifies the number of rows to synthesize batch_size
: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same asnum_rows
.max_tries_per_batch
: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to100
.output_file_path
: A string describing a CSV filepath for writing the synthetic data. Specify toNone
to skip writing to a file. Defaults toNone
.
Specify exact conditions to simulate a hypothetical scenario.
Create a
Condition
object to specify an exact condition that you want to include in the synthetic data. from sdv.sampling import Condition
suite_guests_with_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': True}
)
suite_guests_without_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': False}
)
Parameters
- (required)
num_rows
: The number of rows that need to be included in the scenario - (required)
column_values
: A dictionary with the scenario. The keys should be a column names and the values should be the exact data that the column should have.
You may require multiple conditions. Define as many Condition objects as you need to construct your hypothetical scenario. For example, you may want to create a 50/50 mix of active an inactive users of various tiers.
Use this function to simulate a hypothetical scenario using synthetic data. Your scenario is encoded as Condition objects.
synthetic_data = custom_synthesizer.sample_from_conditions(
conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
output_file_path='synthetic_simulated_scenario.csv'
)
Parameters
- (required)
conditions
: A list of Condition objects that specify the exact values that you want to fix and the number of rows to synthesize. batch_size
: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same asnum_rows
.max_tries_per_batch
: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to100
.output_file_path
: A string describing a CSV filepath for writing the synthetic data. Specify toNone
to skip writing to a file. Defaults toNone
.
Returns A pandas DataFrame object with synthetic data. The synthetic data is simulated based on the conditions.
Synthesize data based on known, reference columns.
Use this function to sample remaining columns based on known, reference columns.
import pandas as pd
reference_data = pd.DataFrame(data={
'room_type': ['SUITE', 'SUITE', 'DELUXE', 'BASIC', 'BASIC'],
'has_rewards': [True, True, True, False, False]
})
synthetic_data = synthesizer.sample_remaining_columns(
known_columns=reference_data,
max_tries_per_batch=500
)
Parameters
- (required)
known_columns
: A pandas DataFrame object with columns that you already know the values for. batch_size
: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same asnum_rows
.max_tries_per_batch
: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to100
.output_file_path
: A string describing a CSV filepath for writing the synthetic data. Specify toNone
to skip writing to a file. Defaults toNone
.
Returns A pandas DataFrame object with synthetic data. The synthetic data is based on the known, reference columns.
Every time you use any of the sampling methods, the synthetic data will be different than the previous runs.
Use this function to reset the randomization. After calling this, any sampling method generates the same data as before. For example in the code below,
synthetic_data1
and synthetic_data2
are the same.synthesizer.reset_sampling()
synthetic_data1 = synthesizer.sample(num_rows=10)
synthesizer.reset_sampling()
synthetic_data2 = synthesizer.sample(num_rows=10)
Parameters None
Returns None. Resets the synthesizer.
Last modified 1mo ago