Cleaning Your Data
Use the utility functions below to clean your multi-table data for fast and effective multi-table modeling.
drop_unknown_references
Multi-table SDV synthesizers work best when your dataset has referential integrity, meaning that all the references in a foreign key refer to an existing value in the primary key. Use this function to drop rows that contain unknown references for your synthesizer.
Parameters
(required)
metadata
: A MultiTableMetadata object(required)
data
: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.drop_missing_values
: A boolean that describes whether to drop missing values in the foreign key(default)
True
: If a foreign key has a missing value, treat it as an unknown reference and drop it. We recommend this setting for maximum efficiency with SDV.False
: If a foreign key has a missing value, treat it as a valid reference and keep it
verbose
: A boolean that controls whether to print out a summary of the results(default)
True
: Print a summary of the number of rows that are dropped from each table
Output A dictionary that maps each table name to a pandas DataFrame containing data. The data will contain referential integrity, meaning that there will be no unknown foreign key references.
simplify_schema
By default, some synthesizers like HMA, are not designed to work with a large number of tables. This function will reduce your schema to a minimal set, by dropping some tables and columns. It will allow you to complete your proof-of-concept using public SDV.
After completing your proof-of-concept, you can reach out to us to inquire about our paid SDV plans. SDV Enterprise supports work many more tables, so you will not have to use simplify_schema
on the paid plan.
Parameters
(required)
data
: A dictionary that maps each table name to a pandas DataFrame containing data. This data should match your metadata.(required)
metadata
: A MultiTableMetadata object that describes your data.
Output:
simplified_data
: A dictionary that maps a table name to a pandas DataFrame containing the data. The simplified data schema may have fewer tables and columns than the original.simplified_metadata
: A MultiTableMetadata object that describes the simplified data
get_random_subset
Use this function to subsample data from your dataset. This function will keep the same overall schema (and columns) but it will reduce the number of rows in each table.
Parameters
(required)
data
: Your full dataset. A dictionary that maps a table name to a pandas DataFrame containing the data(required)
metadata
: A MultiTableMetadata object that describes the data(required)
main_table_name
: A string with the name of the most important table in your dataset. We'll make sure the subsample is optimized for this main table.(required)
num_rows
: The number of rows to subsample from the main table. All other table's sizes will be algorithmically determined based on this.verbose
: Whether to print out the results(default)
True
: Print out how many rows were included from each tableFalse
: Do not print anything out
Output A dataset with fewer rows than before. The dataset will continue to have referential integrity meaning that there will be no invalid or missing references between the tables.
Last updated