Fake it till you make it: an introduction to synthetic data
Conference (BEGINNER level)
Room 4
Score 0.13
Score 0.15
Score 0.16
Score 0.19
The match becomes increasingly accurate as the similarity score approaches zero.

Using ‘real’ data may be tempting, yet under the GDPR it’s not a good idea when dealing with personal information. Unfortunately, testing or debugging software may be harder without having full access to all underlying data. A synthetic dataset can be a good solution: generating fictitious replacement data, that mimics the structure and distribution of the original data.

Joachim Ganseman from Smals Research talks about how synthetic data can be generated, and especially about the practical concerns and limitations.

  • How do we deal with rarely occurring values, correlations or dependencies?
  • What about the balance between maximum privacy protection vs. retaining enough functional usability?
  • Can we do reliable analytics on a synthetic dataset?

He will share some practical examples using open source software in Python.


Joachim Ganseman

Joachim Ganseman obtained a master's in Computer Science at the University of Antwerp. During a subsequent Ph.D. project he focused on digital signal processing, machine learning and audio analysis, branching out to Queen Mary University of London and Stanford University. After dropping out and a few other stints, he has since 2018 been working at Smals as a member of the Research team, where he focuses on AI-related topics, Machine Learning and Natural Language Processing, in an attempt to unlock their potential for the public sector. His interests include data science, AI, CS education, open source software, digital humanities, digital audio processing, piano and organ music, and everything geeky.

Generated Summary
WARNING: This summary was generated using GPT based on the transcript, as a result spelling mistakes and more importantly hallucinations can be present.

Synthetic Data: An Overview
Smalls and the Government
Smalls is a large ID provider in Belgium that works exclusively for the government and its agencies. Joachimigansman, part of the research division, is talking about synthetic data as a way to make the government more efficient.
What is Synthetic Data?
Synthetic data can be used to mimic the distributions and dependencies of datasets like those found on Kaggle. There are strict regulations in place for personal data, and more regulations are in development. Synthetic data can be used to bypass these regulations, as well as save money on collecting real data.
Uses for Synthetic Data
Synthetic data can be used in many scenarios, such as testing the integration of a full e-health system. It can also be used for machine learning applications, simulations, and testing.
Creating Synthetic Data
Synthetic data is created using random number generators, templates, and libraries such as Faker and Mimisis. These libraries are extensible with custom generation routines, allowing for the generation of data that mimics real-world data.
Generation of Synthetic Data
This article discusses a few ways of generating synthetic data, such as preserving distributions and correlations, doing simulations, and using deep learning algorithms. It provides an example of using the Synthetic Data Vault python library to generate synthetic data from the Adult Census Income Dataset. It also includes a brief overview of the data set and its descriptive statistics.
Synthetic data can be used as an efficient way to bypass regulations and save money on collecting real data. It can be used in many scenarios, such as machine learning applications and simulations, and can be generated using random number generators, templates, and libraries such as Faker and Mimisis.
You can also ask questions on the complete talk using Devoxx Insights