1.2 Generate and Test with "Fake" Data

In the previous step, you will have shared metadata and parameters with DataCebo. DataCebo team will generate a "fake" dataset using your metadata and parameters via the SDV Enterprise’s DayZSynthesizer. The team will share this "fake" data with you.

What is "fake" data and why is this step necessary?

"Fake" date is randomly generated data. It does not have any statistical similarity to the real data as it generated without looking at the real data by simply using your metadata and parameters. For example, for a numerical column that has a range of 1 -100, it will create numbers uniformly distributed in that range.

This step allows you to run your project end to end with fake data, validate the end to end pipeline you set up for your synthetic data project that you will execute during the POC and finally create baseline numbers for your success criteria metrics.

Time and again, this step has proven to be the most critical factor in achieving a successful POC.

In this step, you'll start creating some baseline measurements and gain familiarity with SDV's evaluation tools.

1.2.1 Load in the "Fake" Data

The fake dataset will be contained in a folder, with a CSV file corresponding to each table name of your data. For example:

fake_data/
|--- users.csv
|--- transactions.csv
|--- sessions.csv

Download the folder and then read the folder into Python.

1.2.2 Generate Diagnostic report for the "Fake" Data

The fake dataset is expected to structurally match your schema, but it will lack statistical correlations and it will not adhere to any business rules/constraints. We can verify that it structurally matches your schema by running the Diagnostic Report.

Running the report should provide a printout of the score and sub-scores. We expect these to be 1.0, indicating that the data structurally matches your schema. If this is not the case, please let us know!

Generating report ...

(1/3) Evaluating Data Validity: |██████████| 15/15 [00:00<00:00, 603.69it/s]|
Data Validity Score: 100.0%

(2/3) Evaluating Data Structure: |██████████| 2/2 [00:00<00:00, 151.49it/s]|
Data Structure Score: 100.0%

(3/3) Evaluating Relationship Validity: |██████████| 1/1 [00:00<00:00, 68.51it/s]|
Relationship Validity Score: 100.0%

Overall Score (Average): 100.0%

For more information about the properties captured in the Diagnostic Report, see the SDMetrics API.

1.2.3 Assess the quality of the "Fake" Data

The fake dataset is not expected to have high quality scores, as the data is randomly generated and not based on any actual data patterns. Evaluating the quality of this will provide a useful baseline for the POC. Similar to the Diagnostic Report, run the Quality Report.

The data quality will likely be very low, due to the data being randomly generated.

Generating report ...

(1/4) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 564.15it/s]|
Column Shapes Score: 50.45%

(2/4) Evaluating Column Pair Trends: |██████████| 55/55 [00:00<00:00, 110.40it/s]|
Column Pair Trends Score: 49.29%

(3/4) Evaluating Cardinality: |██████████| 1/1 [00:00<00:00, 53.27it/s]|
Cardinality Score: 45.67%

(4/4) Evaluating Intertable Trends: |██████████| 50/50 [00:00<00:00, 86.54it/s]|
Intertable Trends Score: 41.11%

Overall Score (Average): 46.63%

For more information about the properties captured in the Quality Report, see the SDMetrics API.

1.2.4 Simulate your end-to-end pipeline

Now that we've captured some statistical metrics, it's time to put the fake data to use and measure its overall effect. Load this fake data into your full end-to-end pipeline to simulate how data would flow. For example:

If your goal is to test a synthetic database, load the data into your database (either using AI Connectors or manually) and test compatibility.
For software testing use cases, use the data to validate functional or performance behavior.
For analytics use cases, use it in your analytical workflow and assess performance on key metrics.

As you do this, measure whatever metric you have for evaluating the overall success criteria.

Are you ready for the next step?

By the end of this step you should have the following:

A diagnostic score for the fake data (which should be 1.0)
A quality score for the fake data (which is expected to be low)
A success criteria score that captures the process of using fake data for your full, end-to-end pipeline

Previous1.1 Create Metadata and Extract Training Data Next1.3 Generate and Test with Real Data

Last updated 2 months ago