❖ DPGCSynthesizer
Last updated
Last updated
The DPGCSynthesizer creates synthetic data that is differential private. It is based on the classical statistical methods from the Gaussian Copula synthesizer with added privacy guarantees. DPGC stands for Differential Privacy for Gaussian Copula. For more information about the algorithm, refer to the .
When creating your synthesizer, you are required to pass in:
A object as the first argument
An epsilon
value as the second argument. This is a float (>0) that represents the privacy loss budget you're willing to accommodate. (See the parameter reference below for more information.)
All other parameters are optional. You can include them to customize the synthesizer.
(required) epsilon
: A float >0 that represents the privacy loss budget you are willing to accommodate.
known_min_max_values
: A dictionary that provides the already-known min/max values for any of the numerical or datetime columns. Providing these values will help to conserve the privacy budget and ultimately yield higher quality synthetic data (for the same epsilon value).
The min/max values should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that these min/max values are prior knowledge that is not based on any computations of the real data.
(default) None
There are no known min/max values. The synthesizer will use up some of your privacy loss budget to compute differentially-private min/max values.
<dictionary>
A dictionary with the known min/max values. The keys are the column names, and the value is another dictionary containing 'min'
and 'max'
keys. (You can provide one or both.)
For numerical columns, represent the min/max values as floats; for datetimes, represent them as pd.Timestamp objects.
enforce_rounding
: Control whether the synthetic data should have the same number of decimal digits as the real data
(default) True
The synthetic data will be rounded to the same number of decimal digits that were observed in the real data
False
The synthetic data may contain more decimal digits than were observed in the real data
locales
: A list of locale strings. Any PII columns will correspond to the locales that you provide.
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
For example ["en_US", "fr_CA"]
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters None
Output A dictionary with the parameter names and the values
Use this function to access the metadata object that you have included for the synthesizer
Parameters None
To learn a machine learning model based on your real data, use the fit
method.
Parameters
Output (None)
Save your trained synthesizer for future use.
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required) filepath
: A string describing the filepath where you want to save your synthesizer. Make sure this ends in .pkl
Output (None) The file will be saved at the desired location
Use this function to load a trained synthesizer from a Python pickle file
Parameters
(required) filepath
: A string describing the filepath of your saved synthesizer
Output Your synthesizer, as a GaussianCopulaSynthesizer object
Your synthetic data is differentially private. You can sample any number of synthetic data rows after fitting your synthesizer. Our algorithms ensure that all the synthetic data is differentially private.
For all options, see the .
Output A object
(required) data
: A object containing the real data that the machine learning model will learn from
Technical Details: This synthesizer uses the Gaussian Copula methodology, but with modifications to ensure differential privacy. For more information about the algorithm, please refer to .
After training your synthesizer, you can now sample synthetic data. See the section for more details.