Cleaning Your Data
Use the utility functions below to clean your multi-table data for fast and effective multi-table modeling.
drop_unknown_references
Multi-table SDV synthesizers work best when your dataset has referential integrity, meaning that all the references in a foreign key refer to an existing value in the primary key. Use this function to drop rows that contain unknown references for your proof-of-concept synthesizer.
Parameters
(required)
metadata
: A MultiTableMetadata object(required)
data
: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.drop_missing_values
: A boolean that describes whether to drop missing values in the foreign key(default)
True
: If a foreign key has a missing value, treat it as an unknown reference and drop it. We recommend this setting for maximum efficiency with SDV.False
: If a foreign key has a missing value, treat it as a valid reference and keep it
verbose
: A boolean that controls whether to print out a summary of the results(default)
True
: Print a summary of the number of rows that are dropped from each table
Output A dictionary that maps each table name to a pandas DataFrame containing data. The data will contain referential integrity, meaning that there will be no unknown foreign key references.
simplify_schema
By default, some synthesizers like HMA, are not designed to work with a large number of tables. This function will reduce your schema to a minimal set, by dropping some tables and columns. It will allow you to complete your proof-of-concept using public SDV.
After completing your proof-of-concept, you can reach out to us to inquire about our paid SDV plans. SDV Enterprise supports work many more tables, so you will not have to use simplify_schema
on the paid plan.
Parameters
(required)
data
: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.(required)
metadata
: A MultiTableMetadata object that describes your data.
Output:
simplified_data
: A dictionary that maps a table name to a pandas DataFrame containing the data. The smplified data schema may have fewer tables and columns than the original.simplified_metadata
: A MultiTableMetadata object that describes the simplified data
Last updated