Links

Evaluation

As a final step to your synthetic data project, you can evaluate and visualize the synthetic data against the real data.

Evaluation

Compare the real and synthetic data to determine whether the statistical and mathematical properties are similar.

evaluate_quality

Use this function to evaluate the quality of your synthetic data in terms of column shapes and correlations.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata)
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 17/17
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 136/136
Overall Quality Score: 80.5%
Properties:
- Column Shapes: 82.0%
- Column Pair Trends: 79.0%
Parameters
  • (required) real_data: A pandas DataFrame object with the real data
  • (required) synthetic_data: A pandas DataFrame object with the synthetic data
  • (required) metadata: A SingleTableMetadata object that describes the columns
  • verbose: A boolean that indicates whether to print the progress of running the diagnostic. Defaults to True.
Output An SDMetrics Quality Report object generated with your real and synthetic data
How is this score computed? This score is based on the shapes of individual columns as well as the correlations (trends) between every pair of columns. For more information, see the SDMetrics Quality Report.
Get more information. Drill down further by interacting with the report object.
>>> quality_report.get_score()
0.8911957928670733
>>> quality_report.get_properties()
Property Score
Column Shapes 0.902702
Column Pair Trends 0.879690
>>> quality_report.get_details(property_name='Column Shapes')
Column Metric Score
amenities_fee KSComplement 0.921127
checkin_date KSComplement 0.926000
...
See the Quality Report API for more details.

run_diagnostic

Use this function to receive some diagnostic results about your synthetic data. Check to see if the synthetic rows are pure copies of the real data, if the synthetic data covers the full range of values and if the synthetic data adheres to the original ranges.
from sdv.evaluation.single_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:03<00:00, 1.10it/s]
DiagnosticResults:
SUCCESS:
✓ The synthetic data covers over 90% of the numerical ranges present in the real data
✓ The synthetic data covers over 90% of the categories present in the real data
✓ Over 90% of the synthetic rows are not copies of the real data
✓ The synthetic data follows over 90% of the min/max boundaries set by the real data
Parameters
  • (required) real_data: A pandas DataFrame object with the real data
  • (required) synthetic_data: A pandas DataFrame object with the synthetic data
  • (required) metadata: A SingleTableMetadata object that describes the columns
  • verbose: A boolean that indicates whether to print the progress of running the diagnostic. Defaults to True.
Output An SDMetrics DiagnosticReport object generated with your real and synthetic data.
The diagnostic report contains a summary as well as detailed breakdowns to uncover new insights. You can interact with the object to learn more.
>>> diagnostic_report.get_results()
{
'SUCCESS': [
'The synthetic data covers over 90% of the numerical ranges present in the real data',
'The synthetic data covers over 90% of the categories present in the real data',
...
'WARNING': ... ,
'DANGER': ...
}
>>> diagnostic_report.get_properties()
{
'Coverage': 0.959794788777474,
'Synthesis': 0.948,
'Boundaries': 0.9540833333333333
}
>>> diagnostic_report.get_details(property_name='Coverage')
Column Metric Diagnostic Score
amenities_fee RangeCoverage 1.00000
checkin_date RangeCoverage 1.00000
...
See the Diagnostic Report API for more details.

Visualization

Visualize the shapes of your columns in 1D and 2D.

get_column_plot

Use this function to visualize a real column against the same synthetic column. You can plot any column of type: boolean, categorical, datetime or numerical.
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
column_name='amenities_fee',
metadata=metadata
)
fig.show()
Parameters
  • (required) real_data: A pandas DataFrame object containing the table of your real data
  • (required) synthetic_data: A pandas DataFrame object containing the synthetic data
  • (required) column_name: The name of the column you want to plot
  • (required) metadata: A SingleTableMetadata object that describes the columns
Output A plotly Figure object that plots the distribution. This will change based on the sdtype.
Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.

get_column_pair_plot

Use this utility to visualize the trends between a pair of columns for real and synthetic data. You can plot any 2 columns of type: boolean, categorical, datetime or numerical. The columns do not have to the be the same type.
from sdv.evaluation.single_table import get_column_pair_plot
fig = get_column_pair_plot(
real_data=real_data,
synthetic_data=synthetic_data,
column_names=['room_rate', 'room_type'],
metadata=metadata)
fig.show()
Parameters
  • (required) real_data: A pandas DataFrame object containing the table of your real data
  • (required) synthetic_data: A pandas DataFrame object containing the synthetic data
  • (required) column_names: A list with the names of the 2 columns you want to plot
  • (required) metadata: A SingleTableMetadata object that describes the columns
Output A plotly Figure object that plots the 2D distribution. This will change based on the sdtype.
Use fig.show() to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.

Need more evaluation options?

This library includes many more metrics (some experimental) that you can apply based on your goals. All you need is your real data, synthetic data and metadata to get started.
Copyright (c) 2023, DataCebo, Inc.