Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ BootstrapSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
      • Privacy
        • Empirical Differential Privacy
        • SDMetrics: Privacy Metrics
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraint-Augmented Generation (CAG)
      • Predefined Constraints
        • FixedCombinations
        • FixedIncrements
        • Inequality
        • OneHotEncoding
        • Range
        • ❖ CarryOverColumns
        • * ChainedInequality
        • ❖ CompositeKey
        • ❖ FixedNullCombinations
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ MixedScales
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ ReferenceTable
        • ❖ SelfReferentialHierarchy
        • ❖ UniqueBridgeTable
      • Program Your Own Constraint
      • Constraints API
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • How does it work?
  • API & Usage
  1. Single Table Data
  2. Evaluation
  3. Privacy

Empirical Differential Privacy

PreviousPrivacyNextData Preparation

Last updated 2 days ago

is a mathematically-rigorous framework that you can use to create private synthetic data. Using our evaluation tool, you can empirically verify the differential privacy that a synthesizer algorithm is offering for a dataset.

How does it work?

In the differential privacy setup, we are interested in measuring the impact that 1 row of training data has on the overall parameters that a synthesizer learns. Depending on the synthesizer's exact algorithm, the parameters may not be easily accessible or interpretable. Instead, we can create synthetic data using the synthesizer and assume that the patterns exhibited by the synthetic data reflect the parameters.

Our evaluation setup creates multiple synthesizers:

  • First, we train a synthesizer on all of the real, training data,

  • Then, we remove a single row of training data, and use it to train a new synthesizer

We can compare the synthetic data that the synthesizers produce. An algorithm with high differential privacy will produce similar synthetic data despite the removal of a row — no matter which row is removed.

API & Usage

from sdv.evaluation.single_table import measure_differential_privacy

privacy_score = measure_differential_privacy(
  data=my_dataframe,
  metadata=my_metadata,
  synthesizer_name='GaussianCopulaSynthesizer',
  synthesizer_parameters={ 'default_distribution': 'norm' },
  num_rows_synthetic_data=1000000,
  num_rows_test=10,
  test_data_seed=42,
  verbose=True
)

Measuring differential privacy may take some time. This empirical measure trains multiple synthesizers. Depending on the synthesizer algorithm, the size of the dataset, and the number of rows you'd like to test, the overall differential privacy measure may take significant time and computing resources. We recommend starting with a smaller dataset and smaller set of test rows.

Parameters:

  • (required) data: A pandas.DataFrame containing the real data for training the synthesizer

  • synthesizer_parameters: A dictionary with the parameters to pass into the synthesizer. Use this to fine-tune the synthesizer algorithm.

    • (default) None: Use the default parameters for the given synthesizer

    • <dict>: A dictionary of parameters to use to fine-tune the synthesizer algorithm. The keys represent the parameter names, and the values are the parameter values.

  • num_rows_synthetic_data: The number of rows of synthetic data to produce before doing the differential privacy computations. We recommend using a large number of rows to get a stable representation of what the synthesizer has learned.

    • (default) 1000000: Create 1 million rows of synthetic data each time we train a synthesizer

  • num_rows_test: The number of rows of real data to test in a leave-one-out fashion. Each row represents an iteration of leaving the row out, training a synthesizer on the remaining data, and creating synthetic data. The evaluation tool optimizes the rows to leave out by purposefully choosing rows with outliers and other interesting patterns.

    • (default) 20: Choose 20 rows to leave out (1 at a time) and measure differential privacy.

  • test_data_seed: A seed to use to deterministically pick the rows to test

    • (default) None: Do not set a seed. Different rows may be left out each time you call this evaluation tool

  • verbose: Whether to show progress.

    • (default) True: Show a progress bar for each row that is tested

    • False: Do not show a progress bar

Returns: A privacy score representing the empirical differential privacy using the synthesizer algorithm for the given dataset. The score ranges from 0 to 1, describing the impact that 1 row of training data has on the synthesizer.

  • (best) 1.0: The synthesizer offers the best possible differential privacy protection. A single row of training data has no impact on what the synthesizer learns.

  • (worst) 0.0: The synthesizer offers the worst possible differential privacy protection. A single row of training data has a massive impact on what the synthesizer learns.

In the SDV's setup, we compare the statistical differences in the different synthetic datas using the . (But in reality, we could use any statistical measure.) We repeated this process many times, leaving out a different row each time. The differential privacy score represents the worst case scenario that we measure when leaving out a row of real data.

Use the measure_differential_privacy tool to empirically measure the differential privacy of a synthesizer algorithm on a dataset. You can supply any for evaluation.

(required) metadata: An object that describes your data

(required) synthesizer_name: A string with the name of the synthesizer algorithm to use. You can choose from any of the that you have access to.

quality score
single-table SDV synthesizer
SDV Metadata
single-table SDV synthesizers
Differential privacy
The differential privacy setup measure the effect that 1 row of training data has on the synthesizer's parameters. We proxy the synthesizer's parameters by producing synthetic data instead.

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page.

Differential Privacy Bundle