Constraint-Augmented Generation (CAG)

Do you have business rules in your dataset? These are deterministic rules that every single row in your data must follow in order to be considered valid. By default, SDV synthesizers are probabilistic so they may not learn to match your rule 100% of the time.

The good news is that you can input your business rules into your synthesizer using constraints. Our constraint-augmented generation ensures that your synthetic data meets the constraint — 100% of the time.

Identify the business rules in your dataset. At this time, SDV requires you to identify the business rules in your dataset ahead of time, and then input them into your synthesizer as constraints. We do not yet offer automatic constraint detection.

Constraint Example

One example of a business rule is when the values in one column always have to be greater than values in another column. This is true for every single row of data.

In this business rule, the checkout_date must be greater than the checkin_date for all rows.

You can supply this business rule to a synthesizer using using an Inequality constraint.

from sdv.cag import Inequality

# create a constraint that corresponds to your business rule
my_constraint = Inequality(
    low_column_name='checkin_date',
    high_column_name='checkout_date'
)

# add the constraint to your SDV synthseizer
my_synthesizer.add_constraints(constraints=[
    my_constraint
])

Predefined Constraint Classes

Used predefined constraints to apply logic within a single table or between multiple tables. Predefined constraints represent common business rules that may appear in your dataset.

Browse the predefined constraints to learn more. We also recommend going through our tutorial.

Program Your Own Constraint

If your logic cannot be described by predefined constraints, program your own constraint. The logic must be defined in a separate Python file that you can load and add to any synthesizer.

See the Program Your Own Constraint guide for more details. We also recommend going through our tutorial.

Constraints API

Create and add constraint objects using the add_constraints function. For more details, see the API Reference.

FAQs

How is modeling & sampling performance impacted by constraints?

In most cases, the time it takes to fit the model and sample synthetic data should not be significantly affected. However, there are certain scenarios where you may notice a slow-down:

  • You have a large number of constraints that overlap. That is, multiple constraints are referencing the same columns of the data.

  • You are conditional sampling on the constrained columns. This requires some special processing and it may not always be possible to efficiently create conditional synthetic data.

For any questions or feature requests related to performance, please create an issue describing your data, constraints and sampling needs.

How does the SDV handle the constraints?

Under-the-hood, the SDV uses a combination of strategies to ensure that the synthetic data always follows the constraints. These strategies are:

  • Transformation: Most of the time, it's possible to transform the data in a way that guarantees the models will be able to learn the constraint. This is paired with a reverse transformation to ensure the synthetic data looks like the original.

  • Reject Sampling: Another strategy is to model and sample synthetic data as usual, and then throw away any rows in the synthetic data that violate the constraints.

  • Algorithmic Injection: Complex CAG patterns sometimes come with their own algorithms for ensuring robust and accurate modeling. These algorithms may learn additional patterns from your data. They compatible with any SDV synthesizer.

Last updated