❖ DPGCSynthesizer

❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the Differential Privacy Bundle page.

The DPGCSynthesizer creates synthetic data that is differential private. It is based on the classical statistical methods from the Gaussian Copula synthesizer with added privacy guarantees. DPGC stands for Differential Privacy for Gaussian Copula. For more information about the algorithm, refer to the research paper.

from sdv.single_table import DPGCSynthesizer

synthesizer = DPGCSynthesizer(metadata, epsilon=2.5)
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=10)

Creating a synthesizer

When creating your synthesizer, you are required to pass in:

A Metadata object as the first argument
An epsilon value as the second argument. This is a float (>0) that represents the privacy loss budget you're willing to accommodate. (See the parameter reference below for more information.)

All other parameters are optional. You can include them to customize the synthesizer.

synthesizer = DPGCSynthesizer(
  metadata,
  epsilon=2.5, # we recommend values in the 0-10 range; 0-1 is the most conservative
  known_min_max_values={
    'age': {'min': 0, 'max': 120 },
    'salary': { 'min': 0 }
  },
  enforce_rounding=True,
  locales=['en_US'],
)

Parameter Reference

(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.

How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.

Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.
Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.

Note: The smaller your epsilon value, the more data the synthesizer will require to fully enforce differential privacy. The exact size of data required also depends on the # of columns in your dataset. For reference, a dataset with 14 columns will require at least 15K rows for an epsilon of 2.5.

known_min_max_values: A dictionary that provides the already-known min/max values for any of the numerical or datetime columns. Providing these values will help to conserve the privacy budget and ultimately yield higher quality synthetic data (for the same epsilon value).

The min/max values should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that these min/max values are prior knowledge that is not based on any computations of the real data.

(default) None

There are no known min/max values. The synthesizer will use up some of your privacy loss budget to compute differentially-private min/max values.

<dictionary>

A dictionary with the known min/max values. The keys are the column names, and the value is another dictionary containing 'min' and 'max' keys. (You can provide one or both.)

For numerical columns, represent the min/max values as floats; for datetimes, represent them as pd.Timestamp objects.

{
  'age': { 'min': 0, 'max': 120 },
  'salary': { 'min': 0 }
}

enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data

(default) True

The synthetic data will be rounded to the same number of decimal digits that were observed in the real data

False

The synthetic data may contain more decimal digits than were observed in the real data

locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.

(default) ['en_US']

Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)

<list>

Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.

For example ["en_US", "fr_CA"]

For all options, see the Faker docs.

get_parameters

Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.

Parameters None

Output A dictionary with the parameter names and the values

synthesizer.get_parameters()

{
    'epsilon': 2.5,
    'known_min_max_values': {
        'age': { 'min': 0, 'max': 120 },
        'salary': { 'min': 0 }
    }
    'enforce_rounding': False
}

The returned parameters are a copy. Changing them will not affect the synthesizer.

get_metadata

Use this function to access the metadata object that you have included for the synthesizer

Parameters None

Output A Metadata object

metadata = synthesizer.get_metadata()

The returned metadata is a copy. Changing it will not affect the synthesizer.

Learning from your data

To learn a machine learning model based on your real data, use the fit method.

fit

Parameters

(required) data: A pandas DataFrame object containing the real data that the machine learning model will learn from

Output (None)

synthesizer.fit(data)

Technical Details: This synthesizer uses the Gaussian Copula methodology, but with modifications to ensure differential privacy. For more information about the algorithm, please refer to this research paper.

Saving your synthesizer

Save your trained synthesizer for future use.

save

Use this function to save your trained synthesizer as a Python pickle file.

Parameters

(required) filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl

Output (None) The file will be saved at the desired location

synthesizer.save(
    filepath='my_synthesizer.pkl'
)

load (utility function)

Use this utility function to load a trained synthesizer from a Python pickle file. After loading your synthesizer, you'll be able to sample synthetic data from it.

Parameters

(required) filepath: A string describing the filepath of your saved synthesizer

Output Your synthesizer object

from sdv.utils import load_synthesizer

synthesizer = load_synthesizer(
    filepath='my_synthesizer.pkl'
)

This utility function works for any SDV synthesizer.

What's next?

Get the SDVerified stamp of approval. Run the differential privacy verification on your synthesizer. Verify the results before you decide to sample any synthetic data or share your synthesizer.

After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.

synthetic_data = synthesizer.sample(num_rows=10)

FAQs

What happens if columns don't contain numerical data?

This synthesizer models non-numerical columns, including columns with missing values.

Although the Gaussian Copula algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs).

Previous＊ DayZSynthesizer Next❖ DPGCFlexSynthesizer

Last updated 3 days ago