Generating Synthetic Data using deep learning model.

Omkar Gawade
3 min readDec 13, 2022

--

Image by Mahdis mousavi on Unplash.com

There’s no denying that biggest resource in the 21st century is data — Data is often referred as the 4th industrial generation; companies who have adopted this model of collecting data and using it to make informed decisions are the ones who are thriving. But collecting data and generating experiments is not elementary as it sounds — most mid level companies do not have the infrastructure or the resources to store scalable data. In most scenario's when you have to test a machine learning / AI model with no scalable data we have to generate synthetic data using machine learning. In some used cases synthetic data is generated because the company wants to keep its data confidential.

Synthetic Data generation using Synthetic Data Vault (SDV):

Synthetic Data Vault (SDV) is a synthetic data generation ecosystem that allows a user to easily create synthetic data by learning from single table, multi-table , text and time series datasets. SDV uses probabilistic graphical models & deep learning techniques to generate synthetic data. You can read more about the library here.

In our scenario we will be using Gaussian Coupla to generate synthetic data — Gaussian Coupla is methodology in which we generate multivariate random variables by using covariance matrix.

Below is the code to use install and import Gaussian Coupla instance:

Image by author

To fit the instance in Gaussian Coupla lets import a generic dataset into notebook — lets download insurance claim dataset from kaggle. This dataset has categorical, continuous and free text data so that we can see effectiveness/range of SDV library.

Image by author

We can see that from above image that this insurance claim data as object, int64 and float64 data types.

Fitting the dataset in Gaussian Coupla instance and generate 100k samples:

Image by author

Lets check the data type for synthetic variables generated:

Image by author

Just eyeballing the data types between original and synthetic data generated; they exactly match — which shows how effective SDV works.

Let’s take a step further and use table_evaluator library to compare the real and synthetic datasets:

Image by author

Log Mean & Standard deviation comparison between real and fake data

Image by author

Visuals for variable comparison between real and fake data:

I have included visuals for some select variables — the full report generated for 44 variables would be quite extensive to include here

Image by author
Image by author

From the above charts we can see exactly how powerful SDV is it generates categorical & normalized continuous variables.

Let’s get coding ya’ll — generate synthetic data and use it train ML models and more!

Follow my medium blog and add me on linked in — would love to connect with ya’ll.

Linked in: Omkar-gawade

--

--

Omkar Gawade

Sr. Data Scientist @ Sourceday | Data professional with 5 plus years experience in analytics, visualizations and modeling.