❖ SelfReferentialHierarchy

SDV Enterprise Bundle. This feature is available as part of the CAG Bundle, an optional add-on to SDV Enterprise. For more information, please visit the CAG Bundle page.

Use the SelfReferentialHierarchy constraint when you have a column in the table that references the primary key column of the same table (aka a self-reference); and the self-references are not allowed to have any cycles.

In this example, the "Manger ID" column references the primary key, "Employee ID" that is in the same table. This self-reference is not allowed to have cycles, meaning that the manager/employee relationships forms a strict reporting chain all the way up to the CEO. (If person A manages person B, then B cannot manage person A.)

Constraint API

Create a SelfReferentialHierarchy constraint.

Parameters:

  • (required) table_name: A string with the name of the table that contains the self-reference

  • (required) base_column_name: A string with the name of the column that is acting as the base for the reference. This may be the primary key of your table. In our example, the base is the "Employee ID".

    • In older SDV Enterprise versions (0.33.0 and before), this parameter was called primary_key.

  • (required) parent_column_name: A string with the name of the column that is the parent of the base column. In our example, the parent column is "Manager ID".

    • In older SDV Enterprise versions (0.33.0 and before) this parameter was called foreign_key .

  • grandparent_column_name: A string with the name of the column that is the grandparent of the base column (aka the parent's parent). Our example doesn't have a grandparent column, but if it did, it would represent the skip level manager (aka the Manger's Manager).

  • root_column_name: A string with the name of the ultimately root of the tree. Our example doesn't have a root column, but if it did, it would represent the person at the top of the hierarchy (aka the CEO or organizational lead).

  • scaling_method: A string with the name of the method, used when scaling up the synthetic data

    • (default) 'branch': Keep the original depth of the hierarchy but add more branches to it. In our example above, this would add more reports for a given manager.

    • 'depth': Add to the depth of the self-references. In our example above, this would create new levels of managers, increasing the length of the reporting chain to the CEO.

    • 'multiply': Keep the original branching factor and depth to the hierarchy, but create more trees. In our example above, this would create additional companies with new CEOs and reporting chains.

from sdv.cag import SelfReferentialHierarchy

my_constraint = SelfReferentialHierarchy(
    table_name='Employees',
    base_column_name='Employee ID',
    parent_column_name='Manager ID'
)

Make sure that all the table and columns in you provide are in your Metadata. Note that you cannot supply a self-reference relationship in the metadata right now, so the relationships section of your Metadata can be blank.

Usage

Apply the constraint to any SDV synthesizer. Then fit and sample as usual.

synthesizer = HSASynthesizer(metadata)
synthesizer.add_constraints([my_constraint])

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

For more information about using predefined constraints, please see the Constraint-Augmented Generation tutorial.

Visualize the Hierarchy

A self-referential hierarchy can be visualized as a tree-like dependency structure. For example, each row of the data (an employee) can be visualized as a node in an overall tree, pointing to the manager. The root node (topmost node) represents the CEO, followed by mangers, employees, etc. In this way, it's possible to see the overall branch factor and depth of the tree too.

Create a graphic that corresponds to your data using the visualize function.

Parameters:

  • (required) data: A pd.DataFrame object containing the data you want to visualize. The data should match the constraint

  • label_nodes: Toggle whether to label each node of the tree the value from your dataset

    • (default) True: Show the value for each node.

    • False: Do not label the nodes. The visualization will include blank circles representing the nodes.

  • max_trees: If there are multiple, separate trees to your hierarchy, use this parameter to control the max number of individual trees you'd like to visualize

    • (default) None: Visualize all the data

    • <integer>: Only visualize the given # of trees. the remaining rows will not be visualized.

  • max_depth: If the trees have a very high depth, use this parameter to control the maximum depth you'd like to visualize

    • (default) None: Visualize all the data regardless of the depth

    • <integers>: Only visualize up to the given depth. Any nodes deeper than this level will not be visualized.

  • output_filepath: If provided, save the image at the given location in the given format.

    • (default) None: Do not save the visualization

    • <filepath>: A string with the name of the filepath. This must end with the filetype that you want to save as. Popular examples are png, jpg or pdf.

Output A graphviz.graphs.Digraph containing the visualization

from sdv.cag import SelfReferentialHierarchy

my_constraint = SelfReferentialHierarchy(
    table_name='Employees',
    base_column_name='Employee ID',
    parent_column_name='Manager ID')

<graph> = my_constraint.visualize(
  my_dataframe,
  label_nodes=True,
  max_trees=2,
  max_depth=3,
  output_filepath='visualizations/my_graph.png'
)

FAQs

How can the root node (eg. the CEO) be encoded?

In our example, the root node — the CEO — appears in the table as an employee but they do not have any manager associated with them (the value for their manager is null). This is the recommended format.

However, the constraint also supports formats where the root node is encoded in different ways:

  • Self-assignment for the root node. For example, the CEO appears as an employee and their manager is listed as themselves.

  • No entry for the root node. For example, the CEO does not ever appear as an employee at all. (They only appear as a manager.)

This constraint should match either of these formats. Please make sure that if you have multiple root nodes (eg. multiple CEOs), they are all encoded consistently.

Last updated