Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKeys
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Constraint API
  • Usage
  • Visualize the Hierarchy
  1. Concepts
  2. Constraint-Augmented Generation (CAG)
  3. Predefined Constraints

❖ SelfReferentialHierarchy

Previous❖ ReferenceTableNext❖ UniqueBridgeTable

Last updated 2 days ago

Use the SelfReferentialHierarchy constraint when you have a column in the table that references the primary key column of the same table (aka a self-reference); and the self-references are not allowed to have any cycles.

Constraint API

This functionality is in Beta. At this time, select SDV Enterprise users have been invited to use this feature.

Create a SelfReferentialHierarchy constraint.

Parameters:

  • (required) table_name: A string with the name of the table that contains the self-reference

  • (required) primary_key: A string with the name of the primary key column in the table

  • (required) foreign_key: A string with the name of the foreign key column in the same table that references the primary key

  • scaling_method: A string with the name of the method, used when scaling up the synthetic data

    • (default) 'branch': Keep the original depth of the hierarchy but add more branches to it. In our example above, this would add more reports for a given manager.

    • 'depth': Add to the depth of the self-references. In our example above, this would create new levels of managers, increasing the length of the reporting chain to the CEO.

    • 'multiply': Keep the original branching factor and depth to the hierarchy, but create more trees. In our example above, this would create additional companies with new CEOs and reporting chains.

from sdv.cag import SelfReferentialHierarchy

my_constraint = SelfReferentialHierarchy(
    table_name='Employees',
    primary_key='Employee ID',
    foreign_key='Manager ID')

Make sure that all the table and columns in you provide are in your Metadata, and have a primary key associated with them. Note that you cannot supply a self-reference relationship in the metadata right now, so the relationships section of your Metadata can be blank.

Usage

Apply the constraint to any SDV synthesizer. Then fit and sample as usual.

synthesizer = HSASynthesizer(metadata)
synthesizer.add_constraints([my_constraint])

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

Visualize the Hierarchy

A self-referential hierarchy can be visualized as a tree-like dependency structure. For example, each row of the data (an employee) can be visualized as a node in an overall tree, pointing to the manager. The topmost mode represents the CEO, followed by mangers, employees, etc. In this way, it's possible to see the overall branch factor and depth of the tree too.

Create a graphic that corresponds to your data using the visualize function.

Parameters:

  • (required) data: A pd.DataFrame object containing the data you want to visualize. The data should match the constraint

  • show_primary_keys: Toggle whether the primary key IDs should be displayed in the visualization

    • (default) True: Show the primary key ID for each row

    • False: Do not show the primary key. The visualization will include blank circles representing the nodes.

  • max_trees: If there are multiple, separate trees to your hierarchy, use this parameter to control the max number of individual trees you'd like to visualize

    • (default) None: Visualize all the data

    • <integer>: Only visualize the given # of trees. the remaining rows will not be visualized.

  • max_depth: If the trees have a very high depth, use this parameter to control the maximum depth you'd like to visualize

    • (default) None: Visualize all the data regardless of the depth

    • <integers>: Only visualize up to the given depth. Any nodes deeper than this level will not be visualized.

  • output_filepath: If provided, save the image at the given location in the given format.

    • (default) None: Do not save the visualization

    • <filepath>: A string with the name of the filepath. This must end with the filetype that you want to save as. Popular examples are png, jpg or pdf.

Output A graphviz.graphs.Digraph containing the visualization

from sdv.cag import SelfReferentialHierarchy

my_constraint = SelfReferentialHierarchy(
    table_name='Employees',
    primary_key='Employee ID',
    foreign_key='Manager ID')

<graph> = my_constraint.visualize(
  my_dataframe,
  show_primary_keys=True,
  max_trees=2,
  max_depth=3,
  output_filepath='visualizations/my_graph.png'
)
In this example, the "Manger ID" column references the primary key, "Employee ID" that is in the same table. This self-reference is not allowed to have cycles, meaning that the manager/employee relationships forms a strict reporting chain all the way up to the CEO. (If person A manages person B, then B cannot manage person A.)

❖ SDV Enterprise Bundle. This feature is available as part of the CAG Bundle, an optional add-on to SDV Enterprise. For more information, please visit the CAG Bundle page.

For more information about using predefined constraints, please see the Constraint-Augmented Generation tutorial.