Synthetic Data

An AI-project is built on vast amounts of data. Good quality data can be hard or expensive to gather, also there are serious privacy concerns when the data pertains to real persons. On European level, the GDPR imposes high standards and restrictions for data gathering, management and usage.

The consumer is optimally protected in this way, but the work of the data scientist does not become easier. As a result the concept of “synthetic data” is getting some traction: fictitious data simulating the statistical properties of the original dataset. Applications are, among others, dataset rebalancing, masking or anonymizing sensitive data, or making simulation environments for machine learning applications.

Reason enough to take a closer look! In this article for the Smals Research blog I dive deeper into the what and why of synthetic data. (Machine translated version here.)