Synthetic Data Vault
GitHubSlackDataCebo
  • Welcome to the SDV!
  • Tutorials
  • Explore SDV
    • SDV Community
    • SDV Enterprise
      • ⭐Compare Features
    • SDV Bundles
      • ❖ AI Connectors
      • ❖ CAG
      • ❖ Differential Privacy
      • ❖ XSynthesizers
  • Single Table Data
    • Data Preparation
      • Loading Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • GaussianCopulaSynthesizer
        • CTGANSynthesizer
        • TVAESynthesizer
        • ❖ XGCSynthesizer
        • ❖ SegmentSynthesizer
        • * DayZSynthesizer
        • ❖ DPGCSynthesizer
        • ❖ DPGCFlexSynthesizer
        • CopulaGANSynthesizer
      • Customizations
        • Constraints
        • Preprocessing
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Multi Table Data
    • Data Preparation
      • Loading Data
        • Demo Data
        • CSV
        • Excel
        • ❖ AlloyDB
        • ❖ BigQuery
        • ❖ MSSQL
        • ❖ Oracle
        • ❖ Spanner
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • Synthesizers
        • * DayZSynthesizer
        • * IndependentSynthesizer
        • HMASynthesizer
        • * HSASynthesizer
      • Customizations
        • Constraints
        • Preprocessing
      • * Performance Estimates
    • Sampling
    • Evaluation
      • Diagnostic
      • Data Quality
      • Visualization
  • Sequential Data
    • Data Preparation
      • Loading Data
      • Cleaning Your Data
      • Creating Metadata
    • Modeling
      • PARSynthesizer
      • Customizations
    • Sampling
      • Sample Realistic Data
      • Conditional Sampling
    • Evaluation
  • Concepts
    • Metadata
      • Sdtypes
      • Metadata API
      • Metadata JSON
    • Constraints
      • Predefined Constraints
        • Positive
        • Negative
        • ScalarInequality
        • ScalarRange
        • FixedIncrements
        • FixedCombinations
        • ❖ FixedNullCombinations
        • ❖ MixedScales
        • OneHotEncoding
        • Inequality
        • Range
        • * ChainedInequality
      • Custom Logic
        • Example: IfTrueThenZero
      • ❖ Constraint Augmented Generation (CAG)
        • ❖ CarryOverColumns
        • ❖ CompositeKey
        • ❖ ForeignToForeignKey
        • ❖ ForeignToPrimaryKeySubset
        • ❖ PrimaryToPrimaryKey
        • ❖ PrimaryToPrimaryKeySubset
        • ❖ SelfReferentialHierarchy
        • ❖ ReferenceTable
        • ❖ UniqueBridgeTable
  • Support
    • Troubleshooting
      • Help with Installation
      • Help with SDV
    • Versioning & Backwards Compatibility Policy
Powered by GitBook

Copyright (c) 2023, DataCebo, Inc.

On this page
  • Creating a synthesizer
  • Parameter Reference
  • get_parameters
  • get_metadata
  • Learning from your data
  • fit
  • Saving your synthesizer
  • save
  • GaussianCopulaSynthesizer.load
  • What's next?
  • FAQs
  1. Single Table Data
  2. Modeling
  3. Synthesizers

❖ DPGCFlexSynthesizer

Previous❖ DPGCSynthesizerNextCopulaGANSynthesizer

Last updated 16 days ago

The DPGCFlexSynthesizer creates synthetic data that is differential private. It is similar to the but with modifications that allow you to be more flexible with the preprocessing. You can for most columns without violating differential privacy. For more information about the methodology, refer to the .

This is an experimental synthesizer! Let us know if you're finding the modeling process and synthetic data creation useful.

from sdv.single_table import DPGCFlexSynthesizer

synthesizer = DPGCFlexSynthesizer(metadata, epsilon=2.5)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in:

  • A object as the first argument

  • An epsilon value as the second argument. This is a float (>0) that represents the privacy loss budget you're willing to accommodate. (See the parameter reference below for more information.)

All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = DPGCFlexSynthesizer(
  metadata,
  epsilon=2.5, # we recommend values in the 0-10 range; 0-1 is the most conservative
  known_min_max_values={
    'age': {'min': 0, 'max': 120 },
    'salary': { 'min': 0 }
  },
  enforce_rounding=True,
  locales=['en_US'],
)

Parameter Reference

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

  • Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.

  • Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.

Note: The smaller your epsilon value, the more data the synthesizer will require to fully enforce differential privacy. The exact size of data required also depends on the # of columns in your dataset. For reference, a dataset with 14 columns will require at least 15K rows for an epsilon of 2.5.

known_min_max_values: A dictionary that provides the already-known min/max values for any of the numerical or datetime columns. Providing these values will help to conserve the privacy budget and ultimately yield higher quality synthetic data (for the same epsilon value).

The min/max values should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that these min/max values are prior knowledge that is not based on any computations of the real data.

(default) None

There are no known min/max values. The synthesizer will use up some of your privacy loss budget to compute differentially-private min/max values.

<dictionary>

A dictionary with the known min/max values. The keys are the column names, and the value is another dictionary containing 'min' and 'max' keys. (You can provide one or both.)

For numerical columns, represent the min/max values as floats; for datetimes, represent them as pd.Timestamp objects.

enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data

(default) True

The synthetic data will be rounded to the same number of decimal digits that were observed in the real data

False

The synthetic data may contain more decimal digits than were observed in the real data

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

For example ["en_US", "fr_CA"]

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()
{
    'epsilon': 2.5,
    'known_min_max_values': {
        'age': { 'min': 0, 'max': 120 },
        'salary': { 'min': 0 }
    }
    'enforce_rounding': False
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

Output (None)

synthesizer.fit(data)

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

  • (required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

GaussianCopulaSynthesizer.load

Use this function to load a trained synthesizer from a Python pickle file

Parameters

  • (required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer, as a GaussianCopulaSynthesizer object

from sdv.single_table import DPGCFlexSynthesizer

synthesizer = DPGCFlexSynthesizer.load(
    filepath='my_synthesizer.pkl'
)

What's next?

Your synthetic data is differentially private. You can sample any number of synthetic data rows after fitting your synthesizer. Our algorithms ensure that all the synthetic data is differentially private.

synthetic_data = synthesizer.sample(num_rows=10)

Want to improve your synthesizer? Update transformations used for pre- and post-processing the data. You can update the transformers for any of the basic statistical columns — numerical, categorical, and datetime — while still maintaining differential privacy guarantees.

FAQs

What happens if columns don't contain numerical data?

This synthesizer models non-numerical columns, including columns with missing values.

Although the Gaussian Copula algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs).

What is the difference between DPGCFlex and the regular DPGC Synthesizer?

Both the DPGCFlex and the regular DPGC Synthesizer create synthetic data with differential privacy. The difference is that the DPGCFlex Synthesizer is more flexible in terms of the data pre-processing you can apply. The Flex Synthesizer will allow you to update any of the transformers for numerical, categorical and datetime data while still maintaining differential privacy. The Flex Synthesizer achieves this by adding differentially private noise immediately to your real data — that way you can apply preprocessing without violating differential privacy guarantees.

However, you may notice that the added flexibility comes at the price of data quality. The DPGCFlex Synthesizer is experimental. Please try it out and let us know if it's useful!

For all options, see the .

Output A object

(required) data: A object containing the real data that the machine learning model will learn from

Technical Details: This synthesizer uses the Gaussian Copula methodology, but with modifications to ensure differential privacy. For more information about the algorithm, please refer to .

After training your synthesizer, you can now sample synthetic data. See the section for more details.

For more details, see .

{
  'age': { 'min': 0, 'max': 120 },
  'salary': { 'min': 0 }
}
Metadata
pandas DataFrame
this research paper
Sampling
Preprocessing
Faker docs
DPGCSynthesizer
update the transformers
research paper
Metadata

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the page.

Differential Privacy Bundle