Data Quality
The Quality Report checks for statistical similarity between the real and the synthetic data. Use this to discover which patterns the synthetic data has captured from the real data.
Usage
Run a quality report to receive a score and a corresponding report.
evaluate_quality
Use this function to run a diagnostic on the synthetic data.
Parameters:
(required)
real_data
: A pandas.DataFrame containing the real data(required)
synthetic_data
: A pandas.DataFrame containing the synthetic data(required)
metadata
: A SingleTableMetadata object with your metadataverbose
: A boolean describing whether or not to print the report progress and results. Defaults toTrue
. Set this toFalse
to run the report silently.
Returns: An SDMetrics QualityReport object generated with your real and synthetic data
Interpreting the Score
Your score will vary from 0% to 100%. This value tells you how similar the synthetic data is to the real data.
A 100% score means that the patterns are exactly the same. For example, if you compared the real data with itself (identity), the score would be 100%.
A 0% score means the patterns are as different as can be. This would entail that the synthetic data purposefully contains anti-patterns that are opposite from the real data.
Any score in the middle can be interpreted along this scale. For example, a score of 80% means that the synthetic data is about 80% similar to the real data — about 80% of the trends are similar.
The quality score is expected to vary, and you may never achieve exactly 100% quality. That's ok! The SDV synthesizers are designed to estimate patterns, meaning that they may smoothen, extrapolate, or noise certain parts of the data. For more information, see the FAQs.
What's Included?
The different types of data quality are summarized in the table below.
Property | Description |
---|---|
Column Shapes | The statistical similarity between the real and synthetic data for single columns of data. This is often called the marginal distribution of each column. |
Column Pair Trends | The statistical similarity between the real and synthetic data for pairs of columns. This is often called the correlation or bivariate distributions of the columns. |
get_details
This function returns details about the report's properties. Use it to pinpoint the exact columns or tables that are causing issues.
Parameters:
(required)
property_name
: A string with the name of the property. One of:'Column Shapes'
or'Column Pair Trends'
.
Returns A pandas.DataFrame object with the detailed scores
save
Use this function to save the report object
The report does not save the full real and synthetic datasets. But we still recommend using caution when deciding when to store the report and who to share it with. It does save the metadata along with the score for each property, breakdown and metric.
Parameters:
(required)
filepath
: The name of file to save the object. This must end with.pkl
Returns (None) Saves the report as a file
QualityReport.load
Use this function to load in a previously-saved quality report.
Parameters:
(required)
filepath
: The name of the file where the report is stored
Returns An SDMetrics QualityReport object
FAQs
See the SDMetrics QualityReport for even more details about the metrics and properties included in the report.
Last updated