Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKey
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook
On this page
  • get_column_plot
  • get_column_pair_plot
  • get_cardinality_plot
  1. Multi Table Data
  2. Evaluation

Visualization

PreviousData QualityNextData Preparation

Last updated 9 days ago

Copyright (c) 2023, DataCebo, Inc.

Use these functions to visualize your actual data in 1 or 2-dimensional space. This can help you see what kind of patterns the synthetic data has learned, and identify differences between the real and synthetic data.

get_column_plot

Use this function to visualize a real column against the same synthetic column. You can plot any column of type: boolean, categorical, datetime or numerical.

from sdv.evaluation.multi_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    table_name='guests',
    column_name='amenities_fee'
)
    
fig.show()

Parameters

  • (required) real_data: A pandas DataFrame object containing the table of your real data. To skip plotting the real data, input None.

  • (required) synthetic_data: A pandas DataFrame object containing the synthetic data. To skip plotting the synthetic data, input None.

  • (required) metadata: A Metadata object that describes the columns

  • (required) table_name: The name of the table

  • (required) column_name: The name of the column you want to plot

  • plot_type: The type of plot to create

    • (default) None: Determine an appropriate plot type based on your data type, as specified in the metadata.

    • 'bar': Plot the data as distinct bar graphs

    • 'displot': Plot the data as a smooth, continuous curves

Output A plotly Figure object that plots the distribution. This will change based on the sdtype.

Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.

get_column_pair_plot

Use this utility to visualize the trends between a pair of columns for real and synthetic data. You can plot any 2 columns of type: boolean, categorical, datetime or numerical. The columns do not have to the be the same type.

from sdv.evaluation.multi_table import get_column_pair_plot

fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    table_name='guests',
    column_names=['room_rate', 'room_type'],
    )
    
fig.show()

Parameters

  • (required) real_data: A pandas DataFrame object containing the table of your real data. To skip plotting the real data, input None.

  • (required) synthetic_data: A pandas DataFrame object containing the synthetic data. To skip plotting the synthetic data, input None.

  • (required) metadata: A Metadata object that describes the columns

  • (required) table_name: The name of the table

  • (required) column_names: A list with the names of the 2 columns you want to plot

  • plot_type: The type of plot to create

    • (default) None: Determine an appropriate plot type based on your data type, as specified in the metadata.

    • 'box': Create a box plot showing the quartiles, broken down by different attributes

    • 'violin': Create a violin plot to show distributions, broken down by different attributes. This is an alternative to using 'box'

    • 'heatmap': Create a side-by-side headmap showing the frequency of each pair of values

    • 'scatter': Create a scatter plot that plots each point on a 2D axis

  • sample_size: The number of data points to plot

    • (default) None: Plot all the data points

    • <integer>: Subsample rows from both the real and synthetic data before plotting. Use this if you have a lot of data points.

Output A plotly Figure object that plots the 2D distribution. This will change based on the sdtype.

Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.

get_cardinality_plot

Use this utility to visualize the cardinality of a multi-table relationship. The cardinality refers to the number of child rows that each parent row has. This could be 0 or more.

from sdv.evaluation.multi_table import get_cardinality_plot

fig = get_cardinality_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    child_table_name='sessions',
    parent_table_name='users',
    child_foreign_key='user_id',
    metadata=metadata)
    
fig.show()

Parameters

  • (required) real_data: A dictionary mapping each table name to a pandas DataFrame object with the real data

  • (required) synthetic_data: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data

  • (required) child_table_name: A string describing the name of the child table in the relationship

  • (required) parent_table_name: A string describing the name of the parent table in the relationship

  • (required) child_foreign_key: A string describing the name of the foreign key column of the child table that references the parent table

  • (required) metadata: A Metadata object that describes the data

Output A plotly Figure object that plots the cardinality of the real vs. the synthetic data for the provided relationship.

Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.