Constraint Logic

Do you have rules that every row in the data must follow? Are these the same regardless of how much data there is? You can use constraints to describe this business logic in your metadata.

Predefined Constraint Classes

The SDV has 9 predefined constraint classes that are commonly used in enterprise. For example, when the value in one column must always be greater than another, use the Inequality constraint.

my_constraint = {
    'constraint_class': 'Inequality',
    'table_name': 'guests', # for multi table synthesizers
    'constraint_parameters': {
        'low_column_name': 'checkin_date',
        'high_column_name': 'checkout_date',
        'strict_boundaries': True
    }
}

my_synthesizer.add_constraints(constraints=[
    my_constraint
])

Browse the predefined constraints to learn more.

Custom Business Logic

If your dataset includes business logic that cannot be covered by the predefined constraints, then you can create your own custom constraint. The logic must be defined in a separate Python file that you can load.

synthesizer.load_custom_constraint_classes(
    filepath='custom_constraint_template.py',
    class_names=['MyCustomConstraintClass']
)

Then, you can create a custom constraint just like a predefined constraint.

my_custom_constraint = {
    'constraint_class': 'MyCustomConstraintClass',
    'table_name': 'guests', # for multi table synthesizers
    'constraint_parameters': {
        'column_names': ['column_A', 'column_B'],
        'extra_parameter': 10.00
    }
}

my_synthesizer.add_constraints(constraints=[
    my_custom_constraint
])

See the Custom Business Logic guide for more details.

FAQs

Do you need constraints? Before adding a constraint to your model, carefully consider whether it is necessary. Here are a few questions to ask:

  • How do I plan to use the synthetic data? Without the constraint, the rule may still be valid a majority of the time. Only add the constraint if you require 100% adherence.

  • Who do I plan to share the synthetic data with? Consider whether they will be able to use the business rule to uncover sensitive information about the real data.

  • How did the rule come to be? In some cases, there may be other data sources that are present without extra columns and rules.

In the ideal case, there are only a handful constraints you are applying to your model.

How is modeling & sampling performance impacted by constraints?

In most cases, the time it takes to fit the model and sample synthetic data should not be significantly affected. However, there are certain scenarios where you may notice a slow-down:

  • You have a large number of constraints that overlap. That is, multiple constraints are referencing the same columns of the data.

  • Your categorical data has a high cardinality. For example, you have a categorical column with hundreds of possible categories that you are using in a FixedCombinations constraint.

  • You are conditional sampling on the constrained columns. This requires some special processing and it may not always be possible to efficiently create conditional synthetic data.

For any questions or feature requests related to performance, please create an issue describing your data, constraints and sampling needs.

How does the SDV handle the constraints?

Under-the-hood, the SDV uses a combination of strategies to ensure that the synthetic data always follows the constraints. These strategies are:

  • Transformation: Most of the time, it's possible to transform the data in a way that guarantees the models will be able to learn the constraint. This is paired with a reverse transformation to ensure the synthetic data looks like the original.

  • Reject Sampling: Another strategy is to model and sample synthetic data as usual, and then throw away any rows in the synthetic data that violate the constraints.

Transformation is the most efficient strategy, but it is not always possible to use. For example, multiple constraints might be attempting to transform the same column, or the constraint logic itself may not be possible to achieve through a transformation. In such cases, the SDV will fall back to using reject sampling.

Reject sampling may slow down the sampling process but the synthetic data is still guaranteed to meet the constraint.

Last updated

Copyright (c) 2023, DataCebo, Inc.