Evaluation
As a final step to your synthetic data project, you can evaluate and visualize the synthetic data against the real data.
Compare the real and synthetic data to determine whether the statistical and mathematical properties are similar.
Use this function to evaluate the quality of your synthetic data in terms of column shapes, correlations and parent-child relationships.
from sdv.evaluation.multi_table import evaluate_quality
quality_report = evaluate_quality(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata)
Generating report ...
(1/4) Evaluating Column Shapes: : 100%|██████████| 13/13 [00:00<00:00, 338.47it/s]
(2/4) Evaluating Column Pair Trends: : 100%|██████████| 22/22 [00:00<00:00, 95.98it/s]
(3/4) Evaluating Cardinality: : 100%|██████████| 2/2 [00:00<00:00, 69.99it/s]
(4/4) Evaluating Intertable Trends: : 100%|██████████| 36/36 [00:00<00:00, 111.46it/s]
Overall Quality Score: 62.49%
Properties:
- Column Shapes: 79.23%
- Column Pair Trends: 42.5%
- Cardinality: 80.0%
- Intertable Trends: 48.24%
Parameters
- (required)
real_data
: A dictionary mapping each table name to a pandas DataFrame object with the real data - (required)
synthetic_data
: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data verbose
: A boolean that indicates whether to print the progress of running the diagnostic. Defaults toTrue
.
How is this score computed? This score is based on the shapes of individual columns, the correlations (trends) between every pair of columns and the cardinality between tables. For more information, see the SDMetrics Quality Report.
Get more information. Drill down further by interacting with the report object.
>>> quality_report.get_score()
0.783449101193
>>> quality_report.get_properties()
Property Score
Column Shapes 0.7923
Column Pair Trends 0.4250
Cardinality 0.8800
Intertable Trends 0.4824
>>> quality_report.get_details(property_name='Column Shapes', table_name='guests')
Table Column Metric Score
guests amenities_fee KSComplement 0.921127
guests checkin_date KSComplement 0.926000
...
This method is in Beta testing. We hope to update this function based on your needs. If you have feedback, please let us know on Slack or by filing a new GitHub issue.
Use this function to receive some diagnostic results about your synthetic data. Check to see if the synthetic rows are pure copies of the real data, if the synthetic data covers the full range of values and if the synthetic data adheres to the original ranges.
sdv.evaluation.multi_table import run_diagnostic
diagnostic_report = run_diangnostic(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|████████████████| 200/200 [01:21<00:03, 2.37it/s]
Diagnostic Results
SUCCESS
✓ Over 90% of the synthetic rows are not copies of the real data
✓ The synthetic data covers over 90% of the numerical ranges present in the
real data
WARNING
! The synthetic data is missing more than 10% of the categories present in
the real data
DANGER
x More than 50% the synthetic data does not follow the min/max boundaries
set by the real data
Parameters
- (required)
synthetic_data
: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data - (required)
real_data
: A dictionary mapping each table name to a pandas DataFrame object with the real data verbose
: A boolean that indicates whether to print the progress of running the diagnostic. Defaults toTrue
.
The diagnostic report contains a summary as well as detailed breakdowns to uncover new insights. You can interact with the object to learn more.
>>> diagnostic_report.get_results()
{
'SUCCESS': [
'Over 90% of the synthetic rows are not copies of the real data',
'The synthetic data covers over 90% of the numerical ranges present in the real data'
],
...
}
>>> diagnostic_report.get_properties()
{
'Synthesis': 1.0,
'Coverage': 0.85,
'Boundaries': 0.90
}
>>> diagnostic_report.get_details(property_name='Coverage', table_name='guests')
Table Column Metric Score
guests amenities_fee RangeCoverage 1.00000
guests checkin_date RangeCoverage 1.00000
...
Visualize the shapes of your columns in 1D and 2D.
Use this function to visualize a real column against the same synthetic column. You can plot any column of type:
boolean
, categorical
, datetime
or numerical
. from sdv.evaluation.multi_table import get_column_plot
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata,
table_name='guests',
column_name='amenities_fee'
)
fig.show()

Parameters
- (required)
real_data
: A dictionary mapping each table name to a pandas DataFrame object with the real data - (required)
synthetic_data
: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data - (required)
table_name
: The name of the table - (required)
column_name
: The name of the column in the table you want to plot plot_type
: The type of plot to create- (default)
None
: Determine an appropriate plot type based on your data type, as specified in the metadata. 'bar'
: Plot the data as distinct bar graphs'displot'
: Plot the data as a smooth, continuous curves
sample_size
: The number of data points to plot- (default)
None
: Plot all the data points <integer>
: Subsample rows from both the real and synthetic data before plotting. Use this if you have a lot of data points.
Use
fig.show()
to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.Use this utility to visualize the trends between a pair of columns for real and synthetic data. You can plot any 2 columns of type:
boolean
, categorical
, datetime
or numerical
. The columns do not have to the be the same type.from sdv.evaluation.multi_table import get_column_pair_plot
fig = get_column_pair_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata,
table_name='guests',
column_names=['room_rate', 'room_type']
)
fig.show()
Parameters
- (required)
real_data
: A dictionary mapping each table name to a pandas DataFrame object with the real data - (required)
synthetic_data
: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data - (required)
table_name
: The name of the table - (required)
column_names
: A list with the names of the 2 columns you want to plot. Both columns must be in the table you specified. plot_type
: The type of plot to create- (default)
None
: Determine an appropriate plot type based on your data type, as specified in the metadata. 'box'
: Create a box plot showing the quartiles, broken down by different attributes'heatmap'
: Create a side-by-side headmap showing the frequency of each pair of values'scatter'
: Create a scatter plot that plots each point on a 2D axis
Use
fig.show()
to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.Use this utility to visualize the cardinality of a multi-table relationship. The cardinality refers to the number of child rows that each parent row has. This could be 0 or more.
from sdv.evaluation.multi_table import get_cardinality_plot
fig = get_cardinality_plot(
real_data=real_data,
synthetic_data=synthetic_data,
child_table_name='guests',
parent_table_name='hotels',
child_foreign_key='hotel_id',
metadata=metadata)
fig.show()
Parameters
- (required)
real_data
: A dictionary mapping each table name to a pandas DataFrame object with the real data - (required)
synthetic_data
: A dictionary mapping each table name to a pandas DataFrame object with the synthetic data - (required)
child_table_name
: A string describing the name of the child table in the relationship - (required)
parent_table_name
: A string describing the name of the parent table in the relationship - (required)
child_foreign_key
: A string describing the name of the foreign key column of the child table that references the parent table
Output A plotly Figure object that plots the cardinality of the real vs. the synthetic data for the provided relationship.
Use
fig.show()
to see the plot in an iPython notebook. The plot is interactive, allowing you to zoom, scroll and take screenshots.This library includes many more metrics (some experimental) that you can apply based on your goals. All you need is your real data, synthetic data and metadata to get started.
Last modified 12d ago