Accelerating AI Development With Synthetic Data
DRAWBACKS
Although the utility of employing synthetic data is almost beyond dispute, there are still several shortfalls that attend this application of AI. Here are what some of the more frequently occurring entail:
- Reversing the Process: When rules-based methods create a synthetic dataset, it may be possible to reverse the process to determine, for example, how an individual patient’s data affected analytics on a synthetic dataset. However, it’s much more arduous to reverse the process when synthetic data is created via machine learning. “For the privacy preserving use case where the goal is to have fully anonymous synthetic data, one feature is it’s fully anonymous,” Hann explained. “But you can’t, for example, send an email campaign to ‘customers’ created by synthetic data because they are not real customers. So, when you need to go back to the human original customers, that’s a limitation.”
- Overfitting: Overfitting, in which models learn too much from the features in the training dataset and don’t properly generalize them in production, is a machine learning caveat in general and a potential hazard for synthetic data generation.
- The Best Technique: As Hann observed, “All synthetic data isn’t created equal. There [are] different approaches, levels of quality, and fidelity you get with different solutions.” Finding the right one isn’t always easy.
However, for prudent organizations that know how to mitigate these drawbacks with an approach that’s conducive to their application, the advantages of synthetic data are manifold.
Companies and Suppliers Mentioned