Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKey
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Constraint Example
  • Predefined Constraint Classes
  • Program Your Own Constraint
  • Constraints API
  • FAQs
  1. Concepts

Constraint-Augmented Generation (CAG)

PreviousMetadata JSONNextPredefined Constraints

Last updated 1 day ago

Do you have business rules in your dataset? These are deterministic rules that every single row in your data must follow in order to be considered valid. By default, SDV synthesizers are probabilistic so they may not learn to match your rule 100% of the time.

The good news is that you can input your business rules into your synthesizer using constraints. Our constraint-augmented generation ensures that your synthetic data meets the constraint — 100% of the time.

Constraint Example

One example of a business rule is when the values in one column always have to be greater than values in another column. This is true for every single row of data.

from sdv.cag import Inequality

my_constraint = Inequality(
    low_column_name='checkin_date',
    high_column_name='checkout_date'
)

my_synthesizer.add_constraints(constraints=[
    my_constraint
])

Predefined Constraint Classes

Used predefined constraints to apply logic within a single table or between multiple tables. Predefined constraints represent common business rules that may appear in your dataset.

Program Your Own Constraint

If your logic cannot be described by predefined constraints, program your own constraint. The logic must be defined in a separate Python file that you can load and add to any synthesizer.

Constraints API

FAQs

How is modeling & sampling performance impacted by constraints?

In most cases, the time it takes to fit the model and sample synthetic data should not be significantly affected. However, there are certain scenarios where you may notice a slow-down:

  • You have a large number of constraints that overlap. That is, multiple constraints are referencing the same columns of the data.

  • You are conditional sampling on the constrained columns. This requires some special processing and it may not always be possible to efficiently create conditional synthetic data.

How does the SDV handle the constraints?

Under-the-hood, the SDV uses a combination of strategies to ensure that the synthetic data always follows the constraints. These strategies are:

  • Transformation: Most of the time, it's possible to transform the data in a way that guarantees the models will be able to learn the constraint. This is paired with a reverse transformation to ensure the synthetic data looks like the original.

  • Reject Sampling: Another strategy is to model and sample synthetic data as usual, and then throw away any rows in the synthetic data that violate the constraints.

  • Algorithmic Injection: Complex CAG patterns sometimes come with their own algorithms for ensuring robust and accurate modeling. These algorithms are compatible with any SDV synthesizer.

You can supply this business rule to a synthesizer using using an .

to learn more. We also recommend going through our .

See the guide for more details. We also recommend going through our .

Create and add constraint objects using the add_constraints function. For more details, see the .

For any questions or feature requests related to performance, please describing your data, constraints and sampling needs.

Inequality constraint
Browse the predefined constraints
tutorial
Program Your Own Constraint
tutorial
API Reference
create an issue
In this business rule, the checkout_date must be greater than the checkin_date for all rows.