Cleaning Your Data

Use the utility functions below to clean your multi-table data for fast and effective multi-table modeling.

drop_unknown_references

Multi-table SDV synthesizers work best when your dataset has referential integrity, meaning that all the references in a foreign key refer to an existing value in the primary key. Use this function to drop rows that contain unknown references for your proof-of-concept synthesizer.

Parameters

  • (required) metadata: A MultiTableMetadata object

  • (required) data: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.

  • drop_missing_values: A boolean that describes whether to drop missing values in the foreign key

    • (default) True: If a foreign key has a missing value, treat it as an unknown reference and drop it. We recommend this setting for maximum efficiency with SDV.

    • False: If a foreign key has a missing value, treat it as a valid reference and keep it

  • verbose: A boolean that controls whether to print out a summary of the results

    • (default) True: Print a summary of the number of rows that are dropped from each table

Output A dictionary that maps each table name to a pandas DataFrame containing data. The data will contain referential integrity, meaning that there will be no unknown foreign key references.

from sdv.utils import poc

cleaned_data = poc.drop_unknown_references(metadata, data)
Success! All foreign keys have referential integrity. 

Table Name    # Rows (Original)    # Invalid Rows   # Rows (New)
sessions      1200                 50               1150     
transactions  5000                 0                5000

simplify_schema

By default, some synthesizers like HMA, are not designed to work with a large number of tables. This function will reduce your schema to a minimal set, by dropping some tables and columns. It will allow you to complete your proof-of-concept using public SDV.

After completing your proof-of-concept, you can reach out to us to inquire about our paid SDV plans. SDV Enterprise supports work many more tables, so you will not have to use simplify_schema on the paid plan.

Parameters

  • (required) data: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.

  • (required) metadata: A MultiTableMetadata object that describes your data.

Output:

  • simplified_data: A dictionary that maps a table name to a pandas DataFrame containing the data. The smplified data schema may have fewer tables and columns than the original.

  • simplified_metadata: A MultiTableMetadata object that describes the simplified data

from sdv.utils import poc

simplified_data, simplified_metadata = poc.simplify_schema(data, metadata)

Last updated

Copyright (c) 2023, DataCebo, Inc.