Sampling

Use these sampling methods to create synthetic data from your multi table model.

Sample Realistic Data

Create realistic synthetic data data that follows the same format and mathematical properties as the real data.

sample

Use this function to create synthetic data that mimics the real data

synthetic_data = synthesizer.sample(
    scale=1.5
)

Parameters

scale: A float >0.0 that describes how much to scale the data by

(default) 1

Don't scale the data. The model will create synthetic data that is roughly the same size as the original data.

>1

Scale the data by the specified factor. For example, 2.5 will create synthetic data that is roughly 2.5x the size of the original data.

<1

Shrink the data by the specified pecentage. For example, 0.9 will create synthetic data that is roughtly 90% of the size of the original data.

Returns A dictionary that maps each table name (string) to a pandas DataFrame object with synthetic data for that table. The synthetic data mimics the real data.

How large will the synthetic data be? The scale is based on the size of the data you used for training. The scale determines the size of every parent table (ie a table without any foreign keys).

Note that the synthesizer will algorithmically determine the size of the child tables, so their final sizes will approximately follow the scale, with some minor deviations.

reset_sampling

Use this function to reset any randomization in sampling. After calling this, your synthesizer will generate the same data as before. For example in the code below, synthetic_data1 and synthetic_data2 are the same.

synthesizer.reset_sampling()
synthetic_data1 = synthesizer.sample(scale=1.5)

synthesizer.reset_sampling()
synthetic_data2 = synthesizer.sample(scale=1.5)

Parameters None

Returns None. Resets the synthesizer.

Save Your Data Locally

Save your synthetic data back into its original format on your local machine

save_csvs

Use this function to save your synthetic data locally into CSV files. Each table will be written to a separate CSV file.

from sdv.datasets.local import save_csvs

synthetic_data = synthesizer.sample(scale=1.0)

save_csvs(
  data=synthetic_data,
  folder_name='data/'
  suffix='-synthetic',
  to_csv_parameters={ 'encoding': 'UTF' }
)

Parameters

(required) data: A dictionary mapping each table name to a pandas DataFrame containing the synthetic data
(required) folder_name: The name of the folder you'd like to write the synthetic data in. All CSVs files will be written in the folder.
suffix: An optional string suffix to add to each CSV file name
- (default) If there is no suffix: Each table will be saved as <table_name>.csv
- Supply any other string to add a suffix. If a suffix is provided, it'll be added before .csv, for eg. a suffix of '-synthetic' will create files like '<table_name>-synthetic.csv'.
to_csv_parameters: A dictionary with additional parameters to pass in when saving CSVs. The keys are any of the parameters in pandas.to_csv and the values are the inputs

Returns None. All the tables will be written as CSVs inside the folder name you specified.

＊ Save your data to a database

If your original data came from a database, you can also save your synthetic data into a new database of the same format. We currently support Google's BigQuery databases. Other database types are coming soon!

＊SDV Enterprise Feature. This feature is only available for licensed, enterprise users. To learn more about the SDV Enterprise features and purchasing a license, visit our website.

This functionality is in Beta! Beta functionality may have bugs and may change in the future. Help us out by testing this functionality and letting us know if you encounter any issues.

＊ set_export_config

Use this function to specify which project and database you'd like to import data frame. Also provide your authentication credentials.

Use the same connector from your import. For more details, see the BigQueryConnector import docs.

connector.set_import_config(
    project_id='my_project_id',
    dataset_id='my_dataset',
    auth={
        'info': { ... }
    }
)

Parameters

(required) project_id: A string with the name of your project in BigQuery
(required) dataset_id: A string with the name of your dataset in BigQuery
auth: A dictionary with your authentication credentials.
- (default) None: Use the auth credentials from your environment

How do you pass auth credentials? The recommended approach is to download a JSON file from BigQuery and pass in the filepath. To generate the JSON file, see the BigQuery docs.

auth={
    'info': {
        'json_credentials_path': 'my_folder/credentials.json'
    }
}

Alternatively, you can put this information directly as a dictionary

auth={
    'info': {
        'private_key': ...,
        'client_email': ...,
    }
}

Which permissions are needed for exporting? Exporting data requires write access. For BigQuery, this includes: bigquery.datasets.create, bigquery.datasets.get , bigquery.jobs.create, bigquery.tables.create, and bigquery.tables.export. If you do not have these permissions, please contact your database admin.

Output (None)

＊ export

Use this function to export your synthetic data into a database

connector.export(synthetic_data, metadata)

Parameters

(required) synthetic_data: A dictionary that maps each table name to the synthetic data, represented as a pandas DataFrame
(required) metadata: A MultiTableMetadata object that describes the data
mode: The mode of writing to use during the export
- (default) 'write': Write a new database from scratch. If the database or data already exists, then the function will error out.
- More writing modes will be coming soon!

Output (None) Your data will be written to the database and ready for use by your downstream application!

PreviousPreprocessing NextEvaluation

Last updated 16 days ago