❖ Auto-Detect Constraints

circle-info

SDV Enterprise Bundle. This feature is available as part of the CAG Bundle, an optional add-on to SDV Enterprise. For more information, please visit the CAG Bundle page.

SDV is able to automatically identify common business rules in your dataset and add them as constraints. Use the functionality described in this guide to auto-detect, inspect, update, and add constraints to your synthesizer.

from sdv.multi_table import HSASynthesizer

# create a synthesizer
synthesizer = HSASynthesizer(metadata)

# detect constraints based on the data and add them to the synthesizer
detected_constraints = synthesizer.detect_constraints(data)
synthesizer.add_constraints(detected_constraints)

# now fit and sample the synthesizer like usual
synthesizer.fit(data)
synthetic_data = synthesizer.sample()

Detect Constraints

SDV detects constraints using your synthesizer and data.

circle-info

SDV is only able to detect instances of predefined constraints in your data. If you have programmed your own constraints, we recommend adding them to your synthesizer first before attempting any auto-detection.

<synthesizer instance>.detect_constraints

Use this method with an SDV synthesizer in order to detect constraints. All single- and multi-table SDV synthesizers are supported except for DayZSynthesizer.

circle-exclamation

Parameters:

  • (required) data: Your data. For single-table synthesizers, this is a single pd.DataFrame object. For multi-table synthesizers, your data is a dictionary that maps each table name to a pd.DataFrame object that contains the data.

  • verbose: A boolean that controls whether to print out information during the detection.

    • (default) True: Show a progress bar during detection and print out a list of all all the constraints that are found.

    • False: Do not print anything out during detection.

  • constraint_classes: A list of strings describing which constraint classes to detect

    • (default) ['OneHotEncoding', 'FixedNullCombinations', 'ChainedInequality', 'DenormalizedTable']: Detect constraints for these predefined constraint classes. Note that DenormalizedTable will not be applied to single-table synthesizers.

    • [<string>]: Provide a list of strings with the class names of the Predefined constraints. Currently, SDV offers support for auto-detecting all predefined constraints except for ForeignToForeignKey and ReferenceTable.

  • table_names: A list of table names describing which tables to use when detecting constraints.

    • (default) None: Use all the tables names from the dataset

    • [<string>]: Provide a list of strings with the table names. If provided, SDV will only detect constraints within these tables.

  • detection_params: A dictionary describing any additional parameters to use when detecting each constraint class. This advanced functionality can be used to control the exact specifications of what should be detected.

    • (default) None: Use the default detection capabilities for each of the constraint classes.

    • <dictionary>: A dictionary that maps each constraint class name (string) to a dictionary of parameters for detection. If available, these parameters are listed the API for each individual constraint class. (See Predefined Constraints.)

Returns: A ConstraintList object. This object is similar to a list of constraints. However, it has specialized methods for adding and removing constraints, as the order of constraints may impact the algorithm.

Inspect Constraints

After detecting constraint, you can inspect them by printing out your constraint set object.

Print out your ConstraintList object to see a list of constraints. Each constraint will be printed out next to its index, which indicates the order in which it will be applied.

Remove Constraints

You can remove any of the auto-detected constraints in from your ConstraintList object.

<constraint_list_instance>.remove_constraint

Use this function to remove any of the constraints in your ConstraintList object. Keep in mind that the constraint list is designed to be applied in order. Removing a constraint from the middle of the list may mean that some of the constraints that follow it are no longer applicable.

Parameters:

  • (required) index: An integer describing the position of the constraint that you want to remove. To find the position for a constraint, print out the list.

  • redetect: Whether to re-detect constraints that occur after the one you've removed.

    • (default) False: Remove the constraint and any other constraints that come after it if they are no longer valid.

    • True: Remove the constraint and any other constraints that come after it if they are no longer valid. Then, re-detect constraints in case there are more.

  • verbose: A boolean that controls whether to print out information during removal/re-detection.

    • (default) True: Show a progress bar during deletion and re-detection and print out a list of all all the constraints that are found.

    • False: Do not print anything out during detection.

Returns None. The ConstraintList instance will no longer have the constraint, and may have some new constraints based on re-detection.

circle-info

Why is re-detection necessary? In SDV, each constraint might internally transform the data. For example, one constraint might merge two tables together. Then the following constraint assumes that the merged table exists, and it might use the merged table in order to perform its logic.

In this case, removing the first constraint would make the second one invalid. SDV will either remove it (if redetect=False) or try to re-detect it based on what the data now looks like (if redetect=True).

Add Auto-Detected Constraints to Your Synthesizer

Finally, when all the auto-detected constraints look good, you can add them to a single- or multi-table SDV synthesizer.

<synthesizer_instance>.add_constraints

Use this method to add the auto-detected constraints to your synthesizer instance. All single- and multi-table SDV synthesizers are supported except for DayZSynthesizer.

Parameters:

  • (required) constraints: A ConstraintList object containing your auto-detected constraints.

Returns None. The constraints are now added to the synthesizer. Please make sure to fit the synthesizer first. When you sample synthetic data, it is guaranteed to follow the constraints.

FAQ

chevron-rightDoes the order of the constraints matter?hashtag

Yes! When SDV auto-detects the constraints, they are meant to be applied in the same order that they are found in. This is why the ConstraintList object in an indexed, ordered list. If you had already added constraints to your synthesizer, then the auto-detected constraints are meant to be applied after the ones already added.

Order is important because each constraint might internally transform the data. For example, one constraint might merge two tables together. Then the following constraint assumes that the merged table exists, and it might use the merged table in order to perform its logic. This is also why it's tricky to remove a constraint. In this case, removing the first constraint would make the second one invalid.

chevron-rightCan I auto-detect multiple times?hashtag

Yes, you can call detect_constraints multiple times on the synthesizer with different constraints and parameters. The synthesizer detects new constraints based on the ones you've already added to the synthesizer. So be sure to add each batch of constraints before auto-detecting new ones.

chevron-rightWhy are some constraint classes not enabled by default?hashtag

By default, SDV auto-detection looks for specific predefined constraints that are commonly found in enterprise schemas. Searching through all predefined constraints can be performance intensive, so we have found that defaulting to the most commonly-appearing constraints is a good compromise.

Additionally, some of the predefined constraints completely cover others in terms of their logic. For example, the ChainedInequality constraint completely covers all logic that could be described using the Range or Inequality constraints. It would be redundant to detect all of them.

If you'd like to detect constraints for another class, you can call auto-detect again and provide the constraint classes you'd like to add. Be sure to add any existing constraints to your synthesizer first before calling auto-detect again.

chevron-rightWhat if the auto-detection missed a predefined constraint?hashtag

If auto-detection has missed a constraint, you can try calling detect_constraints again with the specific table names and constraint classes that correspond to it. Be sure to also update the detection parameters to allow for its detection.

If SDV is still unable to detect the constraint, there may be two things going on:

  1. The constraint logic may not actually hold true for all the rows of your dataset. In order to add a constraint, SDV requires that it is always true in your dataset. (For more info, see the CAG FAQ.)

  2. Or alternatively, it could be the case that you have already applied a constraint to your synthesizer that is making this one redundant or invalidating it in some way. For example, a constraint you have already added may be deleting a column; that column is no longer available to be involved in another constraint. If this is happening to you, you can try defining and adding the constraint yourself to the synthesizer. This constraint would now require reject sampling. (For more info, see the CAG FAQ.)

chevron-rightDoes the synthesizer affect which constraints are auto-detected?hashtag

The constraint auto-detection is done only based on your data and metadata. The synthesizer algorithm itself does not affect the outcome of this (eg. GaussianCopula, CTGAN, HSA, etc.).

The auto-detection is done using a synthesizer because this where it fits best within the workflow. If you don't know which synthesizer you'd like to use, or you just want to know the constraints for now, we recommend creating a synthesizer just for the purposes of constraint detection.

You can add the constraints to a different synthesizer than the one you originally used for detection, as long as it's within the same modality (single or multi-table).

Last updated