Cleaning Your Data

Use the utility functions below to clean your multi-table data for fast and effective multi-table modeling.

drop_unknown_references

Multi-table SDV synthesizers work best when your dataset has referential integrity, meaning that all the references in a foreign key refer to an existing value in the primary key. Use this function to drop rows that contain unknown references for your synthesizer.

Parameters

  • (required) metadata: A MultiTableMetadata object

  • (required) data: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.

  • drop_missing_values: A boolean that describes whether to drop missing values in the foreign key

    • (default) True: If a foreign key has a missing value, treat it as an unknown reference and drop it. We recommend this setting for maximum efficiency with SDV.

    • False: If a foreign key has a missing value, treat it as a valid reference and keep it

  • verbose: A boolean that controls whether to print out a summary of the results

    • (default) True: Print a summary of the number of rows that are dropped from each table

Output A dictionary that maps each table name to a pandas DataFrame containing data. The data will contain referential integrity, meaning that there will be no unknown foreign key references.

from sdv.utils import drop_unknown_references

cleaned_data = drop_unknown_references(data, metadata)
Success! All foreign keys have referential integrity. 

Table Name    # Rows (Original)    # Invalid Rows   # Rows (New)
sessions      1200                 50               1150     
transactions  5000                 0                5000

simplify_schema

By default, some synthesizers like HMA, are not designed to work with a large number of tables. This function will reduce your schema to a minimal set, by dropping some tables and columns. It will allow you to complete your proof-of-concept using public SDV.

After completing your proof-of-concept, you can reach out to us to inquire about our paid SDV plans. SDV Enterprise supports work many more tables, so you will not have to use simplify_schema on the paid plan.

Parameters

  • (required) data: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.

  • (required) metadata: A MultiTableMetadata object that describes your data.

Output:

  • simplified_data: A dictionary that maps a table name to a pandas DataFrame containing the data. The simplified data schema may have fewer tables and columns than the original.

  • simplified_metadata: A MultiTableMetadata object that describes the simplified data

from sdv.utils import poc

simplified_data, simplified_metadata = poc.simplify_schema(data, metadata)

get_random_subset

Use this function to subsample data from your dataset. This function will keep the same overall schema (and columns) but it will reduce the number of rows in each table.

from sdv.utils import poc

subsampled_data = poc.get_random_subset(
    data, 
    metadata,
    main_table_name="users", 
    num_rows=100
)

Parameters

  • (required) data: Your full dataset. A dictionary that maps a table name to a pandas DataFrame containing the data

  • (required) metadata: A MultiTableMetadata object that describes the data

  • (required) main_table_name: A string with the name of the most important table in your dataset. We'll make sure the subsample is optimized for this main table.

  • (required) num_rows: The number of rows to subsample from the main table. All other table's sizes will be algorithmically determined based on this.

  • verbose: Whether to print out the results

    • (default) True: Print out how many rows were included from each table

    • False: Do not print anything out

Output A dataset with fewer rows than before. The dataset will continue to have referential integrity meaning that there will be no invalid or missing references between the tables.

Last updated

Copyright (c) 2023, DataCebo, Inc.