Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Installation
  • Importing Real Data
  • ❖ set_import_config
  • ❖ create_metadata
  • ❖ import_random_subset
  • ❖ import_optimized_subset
  • Exporting synthetic data
  • ❖ set_export_config
  • ❖ export
  • FAQs
  • Authentication & Credentials
  • Other FAQs
  1. Multi Table Data
  2. Data Preparation
  3. Loading Data

❖ AlloyDB

PreviousExcelNext❖ BigQuery

Last updated 22 days ago

If your data is available in an AlloyDB database, you can directly connect to it in order to extract the data and metadata. Later, you can use the same connections to write the data back to a database.

This functionality is in Beta! Beta functionality may have bugs and may change in the future. Help us out by testing this functionality and letting us know if you encounter any issues.

Installation

To use this feature, please make sure you have installed the bundle with the optional db-alloydb dependency. For more information, see the .

pip install -U bundle-ai-connectors[db-alloydb] --index-url https://pypi.datacebo.com

Use this connector to create a connection to your AlloyDB database.

from sdv.io.database import AlloyDBConnector

connector = AlloyDBConnector()

Parameters (None)

Output An AlloyDBConnector object that you can use to import data and metadata

Importing Real Data

Import your real data from a database to use with SDV.

❖ set_import_config

Use this function to authenticate into the project and database you'd like to import from.

connector.set_import_config(
    auth={
        ...
    },
    schema_name='public'
)

Parameters

  • schema_name: A string with the name of your schema. Please add this if your database tables are in a named schema.

    • (default) None: You can omit this parameter if you have not set up a schema name. In this case, the connector will use whatever schema name your database assigns as a default. For AlloyDB, this is 'public'.

Output (None)

❖ create_metadata

Use this function to create metadata based on the connection to your database.

metadata = connector.create_metadata(
    table_names=['users', 'transactions', 'sessions'])

Parameters

  • table_names: A list of strings representing the table names that you want to create metadata for

    • (default) None: Create metadata for all the tables in the database

  • infer_sdtypes: A boolean describing whether to infer the sdtypes of each column

    • (default) True: Infer the sdtypes based on the data

    • False: Do not infer the sdtypes. All columns will be marked as unknown, ready for you to manually update

  • infer_keys: A string describing whether to infer the primary and/or foreign keys. Options are:

    • (default) 'primary_and_foreign': Infer the primary keys in each table, and the foreign keys in other tables that refer to them

    • 'primary_only': Infer only the primary keys of each table

    • None: Do not infer any keys

❖ import_random_subset

Use this function to import a random subset of your data from the database and inferred metadata. The size of the subset is automatically determined.

data = connector.import_random_subset(
    metadata=metadata,
    verbose=True
)

Parameters

  • fixed_seed: A boolean that controls the determinism of the random subset

    • (default) False: Different random data will be imported every time you call the function

    • True: The same data will be imported every time you import

  • verbose: A boolean describing whether to print out details about the progress of the import

    • (default) True: Print out the table names and number of rows being imported

    • False: Do not print any details

Output A dictionary that maps each table name of your database (string) to the data, represented as a pandas DataFrame.

❖ import_optimized_subset

Use this function to import a subset of your data, optimized specifically for a given table. You can also control the size.

data = connector.import_optimized_subset(
    metadata=metadata, 
    main_table_name='users',
    num_rows=5000
)

Parameters

  • (required) main_table_name: A string containing the name of the most important table of your database. This table will generally represent the entity that is most critical to your application or business. It must be one of the tables listed in your metadata object.

  • num_rows: The number of rows to sample from the main table. The size of every other table is automatically determined by its connection to the main table.

  • fixed_seed: A boolean that controls the determinism of the random subset

    • (default) False: Different random data will be imported every time you call the function

    • True: The same data will be imported every time you import

  • verbose: A boolean describing whether to print out details about the progress of the import

    • (default) True: Print out the table names and number of rows being imported

    • False: Do not print any details

Output A dictionary that maps each table name of your database (string) to the data, represented as a pandas DataFrame.

After importing the data and metadata, you are now ready to create an SDV synthesizer.

Exporting synthetic data

Export synthetic data into a new database.

We recommend using the same connector as your import. This connector object already knows about the specifics of your database schema. It will ensure that the exported data schema has the same format.

❖ set_export_config

Use this function to specify which project and database you'd like to export data to. Also provide your authentication credentials.

connector.set_export_config(
    auth={
        ...
    },
    schema_name='public'
)

Parameters

  • schema_name: A string with the name of your schema. Please add this if your database tables are in a named schema.

    • (default) None: You can omit this parameter if you have not set up a schema name. In this case, the connector will use whatever schema name your database assigns as a default. For AlloyDB, this is 'public'.

Output (None)

❖ export

Use this function to export your synthetic data into a database

connector.export(
    data=synthetic_data,
    mode='write',
    verbose=True)

Parameters

  • (required) synthetic_data: A dictionary that maps each table name to the synthetic data, represented as a pandas DataFrame

  • mode: The mode of writing to use during the export

    • (default) 'write': Write a new database from scratch. If the database or data already exists, then the function will error out.

    • 'append': Append rows to existing tables in the database

    • 'overwrite': Remove any existing tables in the database and replace them with this data

  • verbose: A boolean describing whether to print out details about export

    • (default) True: Print the details

    • False: Do not print anything

Output (None) Your data will be written to the database and ready for use by your downstream application!

FAQs

Authentication & Credentials

Authenticating a user account

If you are connecting a user account, we recommend using the Application Default Credentials (ADC).

Prerequisites: Run through these commands on your terminal (command line) before using Python.

gcloud init
gcloud auth application-default login
  1. Next, run the following command in your terminal to generate a password to use for authentication. Save this password.

gcloud auth print-access-token
  1. Finally run the following command to ensure you have access to the database. Fill in the each of the sections with your information.

gcloud alloydb users create <email> --cluster=<cluster> --region=<region> --type=IAM_BASED

Authentication in Python: Now, you can use the Python library. The AI connector will automatically be able to find your file and authenticate you. Provide the following information when setting up your Python connector:

connector.set_import_config(
    schema_name='public',
    auth={
        'database': 'value', # required
        'project_id': 'value', # required
        'region': 'value', # required
        'cluster_id': 'value', # required
        'instance_id': 'value', # required
        'ip_type': 'PUBLIC', # required: either 'PUBLIC' or 'PRIVATE'
        'user': 'youremail@company.com', # required, your email used for authentication
        'password': 'your-access-token', # required, from setp #2
    }
)
Authenticating a service account

Recommended approach: When authenticating your AI connector, pass in the JSON filepath along with other necessary info, as shown below.

connector.set_import_config(
    auth={
        'json_credentials_path': 'my_folder/credentials.json', # passs in JSON filepath
        'database': 'value', # required
        'project_id': 'value', # required
        'region': 'value', # required
        'cluster_id': 'value', # required
        'instance_id': 'value', # required
    },
    schema_name='public'
)

Alternative approach (environment variables): Alternatively, you can save the JSON filepath and project ID as global variables by typing the terminal commands below.

export GOOGLE_APPLICATION_CREDENTIALS=my_folder/credentials.json
export GOOGLE_CLOUD_PROJECT=my_project_id

Once you've run the above lines of code, your AI Connector can automatically locate the JSON file that just got saved and pull out the auth credentials from it. You don't have to specify this information anymore.

connector.set_import_config(
    auth={
        'database': 'value', # required
        'project_id': 'value', # required
        'region': 'value', # required
        'cluster_id': 'value', # required
        'instance_id': 'value', # required
    },
    schema_name='public'
)

Other FAQs

Which permissions are needed for importing and exporting?

Importing data requires read access. Your database admin should run the following command to grant you access.

GRANT SELECT ON <schema_name | table_name> TO <user_name | role_name>;

Exporting data requires write access. Your database admin should run the following command to grant you access.

GRANT CREATE, UPDATE, DELETE, INSERT ON <schema_name | table_name> TO <user_name | role_name>;

Please note that your Python client will also need the base permissions for roles/alloydb.client and roles/alloydb.viewer. These roles grant the following permissions: alloydb.clusters.generateClientCertificate, alloydb.clusters.get, alloydb.instances.connect, alloydb.instances.get, alloydb.users.login, alloydb.instances.executeSql

(required) auth: A dictionary with your authentication credentials (for details, see )

Output A object representing your metadata

The detected metadata is not guaranteed to be accurate or complete. Be sure to carefully inspect the metadata and update information. For more information, see the .

(required) metadata: A object that describes the data you want to import

(required) metadata: A object that describes the data you want to import

(required) auth: A dictionary with your authentication credentials (for details, see )

For more information about authentication, .

Before using your connector, you will need to authenticate into the database. To do this, start by determining whether your account is a user account or a service account. Service accounts are typically made for non-human users. (For more details, .)

To get started, install the Google CLI using . Then run the following command on your terminal:

This will prompt you to sign in, and then save the credentials onto your local machine in a pre-determined location ().

If you are connecting a service account, start by downloading a JSON credential file for the service account onto your machine. Follow the instructions in to select your service account, add a key, and download the JSON file. (If you need help with this step, please contact your database admin.) Then follow the recommended approach below.

If you are using a Google Cloud API, you will also need the Service Usage Consumer permissions (roles/serviceusage.serviceUsageConsumer). To login with a service account, you will need to with the flag alloydb.iam_authentication (this is not enabled by default).

For more information, see .

Metadata
Metadata API
Metdata
Metadata
see the docs
these instructions
see the docs
these docs
enable IAM authentication
AlloyDB documentation
Authentication
Authentication
see the FAQs
SDV Enterprise Installation guide

❖ SDV Enterprise Bundle. This feature is available as part of the AI Connectors Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page.

AI Connectors Bundle