BNLogLikelihood
Data Likelihood describes a set of metrics that calculate the likelihood of the synthetic data belonging to the real data. This metric uses the log of Bayesian Network to make this calculation.
Data Compatibility
Categorical: This metric is meant for discrete, categorical data
Boolean: This metric works on boolean data
This metric ignores any incompatible column types.
This metric does not accept missing values
Score
(highest) 1.0: According to the algorithm, the synthetic data has the highest possible likelihood of belonging to the real data
(lowest) -∞: According to the algorithm used, the synthetic data has the lowest possible likelihood of belonging to the real data
There are multiple interpretations of the score. A high score can indicates high synthetic data quality as well as low privacy. A low score can indicate low synthetic data quality as well as high privacy.
How does it work?
This metric uses a Bayesian Network [1] from pomegranate [2] to learn the distribution of the real data. The model learns to produce a likelihood estimate for every row ranging from 0 to 1, where 0 means the row is likely not part of the data and 1 means that it is.
We apply the model to all the synthetic data. This metric takes the natural log of every score, which transform the score from the [0, 1] range to the [-∞, 1] range. The final score is the average of all scores.
Usage
You will need to install the pomegranate
library in order to use this metric
Access this metric from the single_table
module and use the compute
method.
Parameters
(required)
real_data
: A pandas.DataFrame containing the real data(required)
synthetic_data
: A pandas.DataFrame containing the same columns of synthetic datametadata
: A metadata dictionary describing the columns (see Single Table Metadata)structure
: The BayesianNetwork structure to use when fitting to the real data. If not passed, learn it from the data using the Chow-Liu algorithm [3].
FAQs
This metric is in Beta. Be careful when using the metric and interpreting its score.
The score heavily depends on algorithm used to model the data. If the overall distribution of the real data cannot be learned well, then the likelihood estimates of the synthetic data may not be valid.
There are multiple interpretations for this metric. (See the Score section above.) Of course, this is heavily dependent on how well we trust the algorithm to model the real data.
References
[1] https://en.wikipedia.org/wiki/Bayesian_network
Last updated