1.2 Generate and Test with "Fake" Data
In the previous step, you will have shared metadata with DataCebo. DataCebo team will generate a "fake" dataset using your metadata via the SDV Enterprise’s DayZSynthesizer. The team will share this "fake" data with you.
In this step, you'll start creating some baseline measurements and gain familiarity with SDV's evaluation tools.
1.2.1 Load in the "Fake" Data
The fake dataset will be contained in a folder, with a CSV file corresponding to each table name of your data. For example:
fake_data/
|--- users.csv
|--- transactions.csv
|--- sessions.csv
Download the folder and then read the folder into Python.
1.2.2 Generate Diagnostic report for the "Fake" Data
The fake dataset is expected to structurally match your schema, but it will lack statistical correlations and it will not adhere to any business rules/constraints. We can verify that it structurally matches your schema by running the Diagnostic Report.
Running the report should provide a printout of the score and sub-scores. We expect these to be 1.0, indicating that the data structurally matches your schema. If this is not the case, please let us know!
Generating report ...
(1/3) Evaluating Data Validity: |██████████| 15/15 [00:00<00:00, 603.69it/s]|
Data Validity Score: 100.0%
(2/3) Evaluating Data Structure: |██████████| 2/2 [00:00<00:00, 151.49it/s]|
Data Structure Score: 100.0%
(3/3) Evaluating Relationship Validity: |██████████| 1/1 [00:00<00:00, 68.51it/s]|
Relationship Validity Score: 100.0%
Overall Score (Average): 100.0%
For more information about the properties captured in the Diagnostic Report, see the SDMetrics API.
1.2.3 Assess the quality of the "Fake" Data
The fake dataset is not expected to have high quality scores, as the data is randomly generated and not based on any actual data patterns. Evaluating the quality of this will provide a useful baseline for the POC. Similar to the Diagnostic Report, run the Quality Report.
The data quality will likely be very low, due to the data being randomly generated.
Generating report ...
(1/4) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 564.15it/s]|
Column Shapes Score: 50.45%
(2/4) Evaluating Column Pair Trends: |██████████| 55/55 [00:00<00:00, 110.40it/s]|
Column Pair Trends Score: 49.29%
(3/4) Evaluating Cardinality: |██████████| 1/1 [00:00<00:00, 53.27it/s]|
Cardinality Score: 45.67%
(4/4) Evaluating Intertable Trends: |██████████| 50/50 [00:00<00:00, 86.54it/s]|
Intertable Trends Score: 41.11%
Overall Score (Average): 46.63%
For more information about the properties captured in the Quality Report, see the SDMetrics API.
1.2.4 Simulate your end-to-end pipeline
Now that we've captured some statistical metrics, it's time to put the fake data to use and measure its overall effect. Load this fake data into your full end-to-end pipeline to simulate how data would flow. For example:
If your goal is to test a synthetic database, load the data into your database (either using AI Connectors or manually) and test compatibility.
For software testing use cases, use the data to validate functional or performance behavior.
For analytics use cases, use it in your analytical workflow and assess performance on key metrics.
As you do this, measure whatever metric you have for evaluating the overall success criteria.
Are you ready for the next step?
By the end of this step you should have the following:
Last updated