Using ‘real’ data may be tempting, yet under the GDPR it’s not a good idea when dealing with personal information. Unfortunately, testing or debugging software may be harder without having full access to all underlying data. A synthetic dataset can be a good solution: generating fictitious replacement data, that mimics the structure and distribution of the original data.
Joachim Ganseman from Smals Research talks about how synthetic data can be generated, and especially about the practical concerns and limitations.
- How do we deal with rarely occurring values, correlations or dependencies?
- What about the balance between maximum privacy protection vs. retaining enough functional usability?
- Can we do reliable analytics on a synthetic dataset?
He will share some practical examples using open source software in Python.
Joachim Ganseman obtained a master's in Computer Science at the University of Antwerp. During a subsequent Ph.D. project he focused on digital signal processing, machine learning and audio analysis, branching out to Queen Mary University of London and Stanford University. After dropping out and a few other stints, he has since 2018 been working at Smals as a member of the Research team, where he focuses on AI-related topics, Machine Learning and Natural Language Processing, in an attempt to unlock their potential for the public sector. His interests include data science, AI, CS education, open source software, digital humanities, digital audio processing, piano and organ music, and everything geeky.