*MSSQL

If your data is available in a Microsoft SQL database, you can directly connect to it in order to extract the data and metadata. Later, you can use the same connections to write the data back to a database.

This functionality is in Beta! Beta functionality may have bugs and may change in the future. Help us out by testing this functionality and letting us know if you encounter any issues.

Installation

*SDV Enterprise Feature. This feature is only available for licensed, enterprise users. To learn more about the SDV Enterprise features and purchasing a license, visit our website.

To connect to the database, please make sure you have an ODBC Driver for the SQL server installed on your machine. For more information, see instructions from Microsoft for Windows, MacOS, and Linux.

Please also make sure you have installed SDV Enterprise with the optional db-mssql dependency. For more information, see the SDV Enterprise Installation guide.

pip install sdv_enterprise[db-mssql] --index-url https://pypi.datacebo.com --timeout 600

* MSSQLConnector

Use this connector to create a connection to your database.

from sdv.io import MSSQLConnector

connector = MSSQLConnector()

Parameters (None)

Output A connector object that you can use to import data and metadata

Importing Real Data

Import your real data from a database to use with SDV.

* set_import_config

Use this function to authenticate into the project and database you'd like to import from.

connector.set_import_config(
    schema='my_schema_name',
    auth={
        ...
    }
)

Parameters

  • (required) schema: A string with the name of your schema

  • auth: A dictionary with your authentication credentials.

    • (default) None: Use the auth credentials from your environment

How do you pass auth credentials? The recommended approach is to provide a JSON with your credentials.

auth={
    'username': 'my_username',
    'password': 'my_password',
    'database': 'my_database_name',
    'server': 'my_datasets.my_database.windows.net',
    'port': 1234, # optional
    'encrypt': 'yes', # optional
    'trust_server_certificate': 'no', # optional
    'connection_timeout': 30, # optional
    'max_pool_size': 100, # optional
}

Which permissions are needed for importing? You will need to have the db_datareader role for importing data. For more information, see the Microsoft documentation.

Output (None)

* create_metadata

Use this function to create metadata based on the connection to your database.

metadata = connector.create_metadata(
    table_names=['users', 'transactions', 'sessions'])

Parameters

  • table_names: A list of strings representing the table names that you want to create metadata for

    • (default) None: Create metadata for all the tables in the database

  • infer_sdtypes: A boolean describing whether to infer the sdtypes of each column

    • (default) True: Infer the sdtypes based on the data

    • False: Do not infer the sdtypes. All columns will be marked as unknown, ready for you to manually update

  • infer_primary_keys: A boolean describing whether to infer the primary keys in each table

    • (default) True: Infer the primary keys based on the data

    • False: Do not infer primary keys. None of the tables will have primary keys.

  • infer_foreign_keys: A boolean describing whether to infer the foreign keys in each table

    • (default) True: Infer the foreign keys based on the data. Each table may have 0 or more foreign keys.

    • False: Do not infer foreign keys. None of the tables will have foreign keys.

Output A MultiTableMetadata object representing your metadata

The detected metadata is not guaranteed to be accurate or complete. Be sure to carefully inspect the metadata and update information. For more information, see the Metadata Inspection and Update API.

* import_random_subset

Use this function to import a random subset of your data from the database and inferred metadata. The size of the subset is automatically determined.

data = connector.import_random_subset(
    metadata=metadata,
    verbose=True
)

Parameters

  • (required) metadata: A MultiTableMetadata object that describes the data you want to import

  • random_state: An integer that represents the random seed

    • (default) None: Different random data will be imported every time you call the function

    • <integer>: Any time you call the function with this integer, the same random data will be imported

  • verbose: A boolean describing whether to print out details about the progress of the import

    • (default) True: Print out the table names and number of rows being imported

    • False: Do not print any details

Output A dictionary that maps each table name of your database (string) to the data, represented as a pandas DataFrame.

* import_optimized_subset

Use this function to import a subset of your data, optimized specifically for a given table. You can also control the size.

data, metadata = connector.import_optimized_subset(
    metadata=metadata, 
    main_table_name='users',
    num_rows=5000
)

Parameters

  • (required) metadata: A MultiTableMetadata object that describes the data you want to import

  • (required) main_table_name: A string containing the name of the most important table of your database. This table will generally represent the entity that is most critical to your application or business. It must be one of the tables listed in your metadata object.

  • num_rows: The number of rows to sample from the main table. The size of every other table is automatically determined by its connection to the main table.

  • random_state: An integer that represents the random seed

    • (default) None: Different random data will be imported every time you call the function

    • <integer>: Any time you call the function with this integer, the same random data will be imported

  • verbose: A boolean describing whether to print out details about the progress of the import

    • (default) True: Print out the table names and number of rows being imported

    • False: Do not print any details

Output A dictionary that maps each table name of your database (string) to the data, represented as a pandas DataFrame.

After importing the data and metadata, you are now ready to create an SDV synthesizer.

Exporting synthetic data

Export synthetic data into a new database.

We recommend using the same connector as your import. This connector object already knows about the specifics of your database schema. It will ensure that the exported data schema has the same format.

* set_export_config

Use this function to specify which project and database you'd like to import data frame. Also provide your authentication credentials.

connector.set_export_config(
    schema='my_schema_name',
    auth={
        ...
    }
)

Parameters

  • (required) schema: A string with the name of your schema

  • auth: A dictionary with your authentication credentials.

    • (default) None: Use the auth credentials from your environment

How do you pass auth credentials? The recommended approach is to provide a JSON with your credentials.

auth={
    'username': 'my_username',
    'password': 'my_password',
    'database': 'my_database_name',
    'server': 'my_datasets.my_database.windows.net',
    'port': 1234, # optional
    'encrypt': 'yes', # optional
    'trust_server_certificate': 'no', # optional
    'connection_timeout': 30, # optional
    'max_pool_size': 100, # optional
}

Which permissions are needed for exporting? You will need to have the db_datawriter role for exporting data. If you'd like to overwrite data, then you'd need additional permissions given by the db_ddladmin, db_owner, or sysadmin roles. For more information, see the Microsoft documentation.

Output (None)

* export

Use this function to export your synthetic data into a database

connector.export(
    data=synthetic_data,
    mode='write',
    verbose=True)

Parameters

  • (required) synthetic_data: A dictionary that maps each table name to the synthetic data, represented as a pandas DataFrame

  • (required) metadata: A MultiTableMetadata object that describes the data

  • mode: The mode of writing to use during the export

    • (default) 'write': Write a new database from scratch. If the database or data already exists, then the function will error out.

    • 'append': Append rows to existing tables in the database

    • 'overwrite': Remove any existing tables in the database and replace them with this data

  • verbose: A boolean describing whether to print out details about export

    • (default) True: Print the details

    • False: Do not print anything

Output (None) Your data will be written to the database and ready for use by your downstream application!

Last updated

Copyright (c) 2023, DataCebo, Inc.