❖ DPGCFlexSynthesizer
❖ SDV Enterprise Bundle. This feature is available as part of the Differential Privacy Bundle, an optional add-on to SDV Enterprise. For more information, please visit the Differential Privacy Bundle page.
The DPGCFlexSynthesizer creates synthetic data that is differential private. It is similar to the DPGCSynthesizer but with modifications that allow you to be more flexible with the preprocessing. You can update the transformers for most columns without violating differential privacy. For more information about the methodology, refer to the research paper.
This is an experimental synthesizer! Let us know if you're finding the modeling process and synthetic data creation useful.
from sdv.single_table import DPGCFlexSynthesizer
synthesizer = DPGCFlexSynthesizer(metadata, epsilon=2.5)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_rows=10)Creating a synthesizer
When creating your synthesizer, you are required to pass in:
A Metadata object as the first argument
An
epsilonvalue as the second argument. This is a float (>0) that represents the privacy loss budget you're willing to accommodate. (See the parameter reference below for more information.)
All other parameters are optional. You can include them to customize the synthesizer.
synthesizer = DPGCFlexSynthesizer(
metadata,
epsilon=2.5, # we recommend values in the 0-10 range; 0-1 is the most conservative
known_min_max_values={
'age': {'min': 0, 'max': 120 },
'salary': { 'min': 0 }
},
enforce_rounding=True,
locales=['en_US'],
)Parameter Reference
(required) epsilon: A float >0 that represents the privacy loss budget you are willing to accommodate.
How should I chose my privacy loss budget (epsilon)? The value of epsilon is a measure of how much risk you're willing to take on when it comes to privacy.
Values in the 0-1 range indicate that you are not willing to take on too much risk. As a result, the synthetic data will have strong privacy guarantees — potentially at the expense of data quality.
Values in the 2-10 range indicate that you're willing to accept some privacy risk in order to preserve more data quality.
Note: The smaller your epsilon value, the more data the synthesizer will require to fully enforce differential privacy. The exact size of data required also depends on the # of columns in your dataset. For reference, a dataset with 14 columns will require at least 15K rows for an epsilon of 2.5.
known_min_max_values: A dictionary that provides the already-known min/max values for any of the numerical or datetime columns. Providing these values will help to conserve the privacy budget and ultimately yield higher quality synthetic data (for the same epsilon value).
The min/max values should represent prior knowledge of the data. In order to enforce differential privacy, it is critical that these min/max values are prior knowledge that is not based on any computations of the real data.
(default) None
There are no known min/max values. The synthesizer will use up some of your privacy loss budget to compute differentially-private min/max values.
<dictionary>
A dictionary with the known min/max values. The keys are the column names, and the value is another dictionary containing 'min' and 'max' keys. (You can provide one or both.)
For numerical columns, represent the min/max values as floats; for datetimes, represent them as pd.Timestamp objects.
enforce_rounding: Control whether the synthetic data should have the same number of decimal digits as the real data
(default) True
The synthetic data will be rounded to the same number of decimal digits that were observed in the real data
False
The synthetic data may contain more decimal digits than were observed in the real data
locales: A list of locale strings. Any PII columns will correspond to the locales that you provide.
(default) ['en_US']
Generate PII values in English corresponding to US-based concepts (eg. addresses, phone numbers, etc.)
<list>
Create data from the list of locales. Each locale string consists of a 2-character code for the language and 2-character code for the country, separated by an underscore.
For example ["en_US", "fr_CA"]
For all options, see the Faker docs.
get_parameters
Use this function to access the all parameters your synthesizer uses -- those you have provided as well as the default ones.
Parameters None
Output A dictionary with the parameter names and the values
The returned parameters are a copy. Changing them will not affect the synthesizer.
get_metadata
Use this function to access the metadata object that you have included for the synthesizer
Parameters None
Output A Metadata object
The returned metadata is a copy. Changing it will not affect the synthesizer.
Learning from your data
To learn a machine learning model based on your real data, use the fit method.
fit
Parameters
(required)
data: A pandas DataFrame object containing the real data that the machine learning model will learn from
Output (None)
Technical Details: This synthesizer uses the Gaussian Copula methodology, but with modifications to ensure differential privacy. For more information about the algorithm, please refer to this research paper.
Saving your synthesizer
Save your trained synthesizer for future use.
save
Use this function to save your trained synthesizer as a Python pickle file.
Parameters
(required)
filepath: A string describing the filepath where you want to save your synthesizer. Make sure this ends in.pkl
Output (None) The file will be saved at the desired location
load (utility function)
Use this utility function to load a trained synthesizer from a Python pickle file. After loading your synthesizer, you'll be able to sample synthetic data from it.
Parameters
(required)
filepath: A string describing the filepath of your saved synthesizer
Output Your synthesizer object
This utility function works for any SDV synthesizer.
What's next?
Get the SDVerified stamp of approval. Run the differential privacy verification on your synthesizer. Verify the results before you decide to sample any synthetic data or share your synthesizer.
After training your synthesizer, you can now sample synthetic data. See the Sampling section for more details.
Want to improve your synthesizer? Update transformations used for pre- and post-processing the data. You can update the transformers for any of the basic statistical columns — numerical, categorical, and datetime — while still maintaining differential privacy guarantees.
For more details, see Preprocessing.
FAQs
What happens if columns don't contain numerical data?
This synthesizer models non-numerical columns, including columns with missing values.
Although the Gaussian Copula algorithm is designed for only numerical data, this synthesizer converts other data types using Reversible Data Transforms (RDTs).
What is the difference between DPGCFlex and the regular DPGC Synthesizer?
Both the DPGCFlex and the regular DPGC Synthesizer create synthetic data with differential privacy. The difference is that the DPGCFlex Synthesizer is more flexible in terms of the data pre-processing you can apply. The Flex Synthesizer will allow you to update any of the transformers for numerical, categorical and datetime data while still maintaining differential privacy. The Flex Synthesizer achieves this by adding differentially private noise immediately to your real data — that way you can apply preprocessing without violating differential privacy guarantees.
However, you may notice that the added flexibility comes at the price of data quality. The DPGCFlex Synthesizer is experimental. Please try it out and let us know if it's useful!
Last updated