MissingValueSimilarity
This metric compares whether the synthetic data has the same proportion of missing values as the real data for a given column.
- All data: Any data is compatible with this metric as long as it contains missing values
(best) 1.0: The synthetic data perfectly captures the proportion of missing values
(worst) 0.0: The synthetic data has a completely different proportion of missing values than the real data
This test computes the proportion of missing values, p, in both the real and synthetic data, R and S. It normalizes them and returns a similarity score in the range [0, 1], with 1 representing the highest similarity.
Note that the term at the right is equivalent to the Total Variation Distance [1] of the missing/non-missing values between the real and synthetic data
Access this metric from the
single_column
module and use the compute
method.from sdmetrics.single_column import MissingValueSimilarity
MissingValueSimilarity.compute(
real_data=real_table['column_name'],
synthetic_data=synthetic_table['column_name']
)
Parameters
- (required)
real_data
: A pandas.Series containing a single column with missing values - (required)
synthetic_data
: A pandas.Series object with the synthetic version of the column
We use the same convention as pandas for determining when a value is missing [2]. Missing values in your data should be represented as
NaN
objects.If you are using any special notation to denote missing values, convert them to
NaN
values before using this metric.Last modified 1yr ago