Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKeys
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Sample Realistic Data
  • sample
  • reset_sampling
  • Export Your Data
  1. Multi Table Data

Sampling

Use these sampling methods to create synthetic data from your multi table model.

Sample Realistic Data

Create realistic synthetic data data that follows the same format and mathematical properties as the real data.

sample

Use this function to create synthetic data that mimics the real data

synthetic_data = synthesizer.sample(
    scale=1.5
)

Parameters

  • scale: A float >0.0 that describes how much to scale the data by

(default) 1

Don't scale the data. The model will create synthetic data that is roughly the same size as the original data.

>1

Scale the data by the specified factor. For example, 2.5 will create synthetic data that is roughly 2.5x the size of the original data.

<1

Shrink the data by the specified pecentage. For example, 0.9 will create synthetic data that is roughtly 90% of the size of the original data.

Returns A dictionary that maps each table name (string) to a pandas DataFrame object with synthetic data for that table. The synthetic data mimics the real data.

How large will the synthetic data be? During the fitting process, your SDV synthesizer learns the size of each data table. This is assumed to be a scale of 1. Scaling the entire dataset up or down means that the size of each table will change proportionally based on the original data size.

Note that some synthesizers may perform small, additional algorithmic calculations to determine the final size of each table. However, you can still expect the final synthetic data to approximately follow the scale of the real data (with some minor deviations).

reset_sampling

Use this function to reset any randomization in sampling. After calling this, your synthesizer will generate the same data as before. For example in the code below, synthetic_data1 and synthetic_data2 are the same.

synthesizer.reset_sampling()
synthetic_data1 = synthesizer.sample(scale=1.5)

synthesizer.reset_sampling()
synthetic_data2 = synthesizer.sample(scale=1.5)

Parameters None

Returns None. Resets the synthesizer.

Export Your Data

After sampling, export the data back into its original format.

See the Loading Data section for options.

Previous* Performance EstimatesNextEvaluation

Last updated 6 days ago