1.3 Generate and Test with Real Data

In this step, you'll run the same process using the real data (training data). This will form the other extreme of a baseline.

1.3.1 Real Data Diagnostic

The real dataset is expected to structurally match your schema. Similar to before, verify that it structurally matches your schema by running the Diagnostic Report.

Running the report should provide a printout of the score and sub-scores. We expect these to be 1.0, indicating that the data structurally matches your schema. If this is not the case, please let us know!

Generating report ...

(1/3) Evaluating Data Validity: |██████████| 15/15 [00:00<00:00, 603.69it/s]|
Data Validity Score: 100.0%

(2/3) Evaluating Data Structure: |██████████| 2/2 [00:00<00:00, 151.49it/s]|
Data Structure Score: 100.0%

(3/3) Evaluating Relationship Validity: |██████████| 1/1 [00:00<00:00, 68.51it/s]|
Relationship Validity Score: 100.0%

Overall Score (Average): 100.0%

For more information about the properties captured in the Diagnostic Report, see the SDMetrics API.

1.3.2 Real Data Quality

The real dataset is not expected to have the highest possible quality — because it is just mirroring itself. Verify this by running the Quality Report.

The data quality should be 1.0 because it is the same. In some cases, the score may be slightly lower (~0.99) due to some subsampling approximations that the report may make.

Generating report ...

(1/4) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 564.15it/s]|
Column Shapes Score: 100.0%

(2/4) Evaluating Column Pair Trends: |██████████| 55/55 [00:00<00:00, 110.40it/s]|
Column Pair Trends Score: 100.0%

(3/4) Evaluating Cardinality: |██████████| 1/1 [00:00<00:00, 53.27it/s]|
Cardinality Score: 100.0%

(4/4) Evaluating Intertable Trends: |██████████| 50/50 [00:00<00:00, 86.54it/s]|
Intertable Trends Score: 100.0%

Overall Score (Average): 100.0%

For more information about the properties captured in the Quality Report, see the SDMetrics API.

1.3.3 End-to-end pipeline simulation

Now, it's time to put the real data to use and measure its effects. This time, load the real data into your full end-to-end pipeline to simulate how data would flow. For example:

If your goal is to test a synthetic database, load the data into your database (either using AI Connectors or manually) and test compatibility.
For software testing use cases, use the data to validate functional or performance behavior.
For analytics use cases, use it in your analytical workflow and assess performance on key metrics.

As you do this, measure whatever metric you have for evaluating the overall success criteria.

1.3.4 Create your results

At this point you should be able to construct a table with your overall results from both the real and the fake dataset. An hypothetical example is shown below.

Real Dataset

Fake Dataset

Synthetic Data

Diagnostic Score

1.0

Quality Score

1.0

0.4663

Success Criteria Score

0.6581

0.5831

Note that this table contains a final column for the synthetic data, which you should not be able to fill out yet, as we haven't yet created or evaluated the synthetic data.

Please share this data with DataCebo as part of our POC.

Are you ready for the next step?

By the end of this step you should have the following:

A results table with diagnostic, quality, and a success criteria score for the real and fake datasets

Previous1.2 Generate and Test with "Fake" Data NextStep 2

Last updated 2 months ago