Links

Constraint Logic

Do you have rules that every row in the data must follow? Are these the same regardless of how much data there is? You can use constraints to describe this business logic in your metadata.

Predefined Constraint Classes

The SDV has 9 predefined constraint classes that are commonly used in enterprise. For example, when the value in one column must always be greater than another, use the Inequality constraint.
my_constraint = {
'constraint_class': 'Inequality',
'table_name': 'guests', # for multi table synthesizers
'constraint_parameters': {
'low_column_name': 'checkin_date',
'high_column_name': 'checkout_date',
'strict_boundaries': True
}
}
my_synthesizer.add_constraints(constraints=[
my_constraint
])

Custom Business Logic

If your dataset includes business logic that cannot be covered by the predefined constraints, then you can create your own custom constraint. The logic must be defined in a separate Python file that you can load.
synthesizer.load_custom_constraint_classes(
filepath='custom_constraint_template.py',
class_names=['MyCustomConstraintClass']
)
Then, you can create a custom constraint just like a predefined constraint.
my_custom_constraint = {
'constraint_class': 'MyCustomConstraintClass',
'table_name': 'guests', # for multi table synthesizers
'constraint_parameters': {
'column_names': ['column_A', 'column_B'],
'extra_parameter': 10.00
}
}
my_synthesizer.add_constraints(constraints=[
my_custom_constraint
])
See the Custom Business Logic guide for more details.

FAQs

Constraints may slow down the synthetic data model & leak privacy. Before adding a constraint to your model, carefully consider whether it is necessary. Here are a few questions to ask:
  • How do I plan to use the synthetic data? Without the constraint, the rule may still be valid a majority of the time. Only add the constraint if you require 100% adherence.
  • Who do I plan to share the synthetic data with? Consider whether they will be able to use the business rule to uncover sensitive information about the real data.
  • How did the rule come to be? In some cases, there may be other data sources that are present without extra columns and rules.
In the ideal case, there are only a handful constraints you are applying to your model..
Do constraints affect the modeling & sampling performance?
In most cases, the time it takes to fit the model and sample synthetic data should not be significantly affected. However, there are certain scenarios where you may notice a slow-down:
  • You have a large number of constraints that overlap. That is, multiple constraints are referencing the same columns of the data.
  • Your categorical data has a high cardinality. For example, you have a categorical column with hundreds of possible categories that you are using in a FixedCombinations constraint.
  • You are conditional sampling on the constrained columns. This requires some special processing and it may not always be possible to efficiently create conditional synthetic data.
For any questions or feature requests related to performance, please create an issue describing your data, constraints and sampling needs.
Why am I getting a ConstraintsNotMetError when I try to fit my data?
A constraint should describe a rule that is true for every row in your real data. If any rows in the real data violate the rule, the SDV will throw a ConstraintsNotMetError. Since the constraint is not true in your real data, the model will not be able to learn it.
If you see this error, you have two options:
  • (recommended) Remove the constraint. This ensures the model learns patterns that exist in the real data. You can always use conditional sampling later to generate synthetic data with specific values.
  • Clean your input dataset. If you remove the violative rows in the real data, then you will be able to apply the constraint with the cleaned data. This is not recommended because even if the synthetic data follows the rule, the model is not truly representative of the original data.
How does the SDV handle the constraints?
Under-the-hood, the SDV uses a combination of strategies to ensure that the synthetic data always follows the constraints. These strategies are:
  • Transformation: Most of the time, it's possible to transform the data in a way that guarantees the models will be able to learn the constraint. This is paired with a reverse transformation to ensure the synthetic data looks like the original.
  • Reject Sampling: Another strategy is to model and sample synthetic data as usual, and then throw away any rows in the synthetic data that violate the constraints.
Transformation is the most efficient strategy, but it is not always possible to use. For example, multiple constraints might be attempting to transform the same column, or the constraint logic itself may not be possible to achieve through a transformation. In such cases, the SDV will fall back to using reject sampling.
You'll get a warning when this happens. Reject sampling may slow down the sampling process but the synthetic data is still guaranteed to meet the constraint.
Copyright (c) 2023, DataCebo, Inc.