Links

Quickstart

Welcome to SDMetrics! In this guide, we'll go through some basic usage using some demo data.
Tip! Follow along our Quickstart guide using the notebook below

Demo Data

SDMetrics is model-agnostic, which means that it works with synthetic data created by any model at any time.
To get started, you need:
  1. 1.
    The real data, represented as a pandas.DataFrame
  2. 2.
    Your synthetic data, represented as a pandas.DataFrame
  3. 3.
    Metadata, represented as a dictionary
The command below downloads some demo data and metadata that we can use
from sdmetrics import load_demo
real_data, synthetic_data, metadata = load_demo(modality='single_table')
Both the real and synthetic data describe different students. They have the same column names representing the same types of data.
real_data.head()
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date
17264 M 67 91 Commerce 58 Sci&Tech FALSE 0 55 Mkt&HR 58.8 27000 TRUE 2020-07-23 2020-10-12
17265 M 79.33 78.33 Science 77.48 Sci&Tech TRUE 1 86.5 Mkt&Fin 66.28 20000 TRUE 2020-01-11 2020-04-09
17266 M 65 68 Arts 64 Comm&Mgmt FALSE 0 75 Mkt&Fin 57.8 25000 TRUE 2020-01-26 2020-07-13
17267 M 56 52 Science 52 Sci&Tech FALSE 0 66 Mkt&HR 59.43 NaN FALSE NaT NaT
17268 M 85.8 73.6 Commerce 73.3 Comm&Mgmt FALSE 0 96.8 Mkt&Fin 55.5 42500 TRUE 2020-07-04 2020-09-27
The metadata is a dictionary that describes the data types of the different columns. This helps SDMetrics understand which metrics to apply to which columns.
Click to see metadata
See the Single Table Metadata for more details on how to write your own metadata
{
"primary_key": "student_id",
"fields": {
"student_id": {"subtype": "integer", "type': "id"},
"gender": {"type": "categorical"},
"second_perc": {"subtype": "float", "type": "numerical"},
"high_perc": {"subtype": "float", "type": "numerical"},
"high_spec": {"type": "categorical"},
"degree_perc": {"subtype": "float", "type": "numerical"},
"degree_type": {"type": "categorical"},
"work_experience": {"type": "boolean"},
"experience_years": {"subtype": "float", "type": "numerical"},
"employability_perc": {"subtype": "float", "type": "numerical"},
"mba_spec": {"type": "categorical"},
"mba_perc": {"subtype": "float", "type": "numerical"},
"salary": {"subtype": "integer", "type": "numerical"},
"placed": {"type": "boolean"},
"start_date": {"format": "%Y-%m-%d", "type": "datetime"},
"end_date": {"format": "%Y-%m-%d", "type": "datetime"},
"duration": {"type": "categorical"},
}
}

Quality Report

Let's get started by creating a quality report.
from sdmetrics.reports.single_table import QualityReport
report = QualityReport()
report.generate(real_data, synthetic_data, metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00, 7.09it/s]
Overall Quality Score: 82.84%
Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%
Seems like we have an ok quality score of 82.8%. Let's see each of the properties in detail.

Column Shapes

This property looks at the overall shape the columns to see if the synthetic data matches the real data. The report details indicate which columns have a better scores for this property. It also indicates which metric was used to compute the quality score (based on the data type).
report.get_details(property_name='Column Shapes')
Column Metric Quality Score
second_perc KSComplement 0.627907
salary KSComplement 0.869155
gender TVComplement 0.939535
...
You can also visualize these scores.
report.get_visualization(property_name='Column Shapes')
The bar graph shows the quality score for each column, color-coded by the metric that was used. This is based on the data type.
In the visualization, we can see which columns have the best and worst quality scores. Let's use the SDMetrics plotting functions to get more insight.
High Quality Score
Low Quality Score
The mba_spec column has the highest quality (0.995). By plotting it, we can see that the real and synthetic data are extremely similar.
from sdmetrics.reports.utils import get_column_plot
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata,
column_name='mba_spec'
)
fig.show()
A bar plot compares the frequency of categories in real and synthetic data.
The high_perc column has the worst quality (0.553). By plotting it, we can see the differences in the real and synthetic data.
from sdmetrics.reports.utils import get_column_plot
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
column_name='high_perc',
)
fig.show()
A smooth histogram shows the distribution of real and synthetic data for a numerical column.
Next, we can look at the trends between pairs of columns. The report details gives us a breakdown for every pair of columns.
report.get_visualization(property_name='Column Pair Trends')
The heatmap at the top shows the overall quality scores. The two heat maps at the bottom show a side-by-side comparison of the numerical correlations for the real and synthetic data.
We can use the SDMetrics 2D plotting functions to visualize the pairs.
High Quality Score
Low Quality Score
The pair of columns start_date and second_perc have a high score of 0.99. If we plot them, we can see that the synthetic data generally matches the trend.
from sdmetrics.reports.utils import get_column_pair_plot
fig = get_column_pair_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
column_names=['start_date', 'second_perc'],
)
fig.show()
A scatter plot compares two columns that are numerical or datetime.
The pair of columns duration and high_perc have a low score of 0.45. If we plot them, we can see that the synthetic data does not match the associations between these columns.
from sdmetrics.reports.utils import get_column_pair_plot
fig = get_column_pair_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
column_names=['duration', 'high_perc'],
)
fig.show()
A box plot compares a numerical with a categorical column.

Saving and Sharing

You can save the report object to share your insights with your team.
report.save(filepath='sdmetrics_quality_demo.pkl')
# load the report at a later time
report = QualityReport.load(filepath='sdmetrics_quality_demo.pkl')
The report does not save the actual data but it does save the metadata and scores. This may still leak some privacy so be careful who you share the report with.

Applying Individual Metrics

To explore the data further, we can manually apply any of the metrics available in the Glossary or Beta sections. Let's go through some examples.

BoundaryAdherence for Software Testing

The BoundaryAdherence metric can be useful for applications like software testing. It tells us whether the synthetic data has respected the min/max boundaries of the real data.
from sdmetrics.single_column import BoundaryAdherence
BoundaryAdherence.compute(
real_data['start_date'],
synthetic_data['start_date']
)
0.8503937007874016
From looking at the score, it seems like 85% of the data adheres to the boundaries, which means the remaining 15% is outside the min/max range. We can verify this by plotting the data.
from sdmetrics.reports.utils import get_column_plot
get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata,
column_name='start_date'
)

Privacy using NewRowSynthesis

We can also check for data privacy. The NewRowSynthesis metric tells us whether the synthetic data contains any exact copies of the real data -- or whether the rows are new.
from sdmetrics.single_table import NewRowSynthesis
NewRowSynthesis.compute(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
)
1.0
It looks like 100% of the rows in the synthetic data are new -- meaning that there are no exact copies of the real data!
Want to explore more metrics? See the Metrics Glossary and Beta sections

Resources

The SDMetrics library is part of the SDV Project, built & maintained by DataCebo.
To connect with the community, join our Slack channel! We have hundreds of users discussing their synthetic data needs.