-->

Register Now to SAVE BIG & Join Us for Enterprise AI World 2025, November 19-20, in Washington, DC

Accelerating AI Development With Synthetic Data

Article Featured Image

Synthetic data applications are at the intersection of some of the most meaningful developments in enterprise AI. Synthetic data techniques represent some of the earliest manifestations of generative AI and predate the widespread adoption of language models. In fact, employing synthetic data is one of the foremost methods of building—and fine-tuning—language models and foundation models in general.

Synthetic data applications are just as useful for testing AI models as they are for developing them. With the proper implementation, organizations can harness synthetic data to detect, then mitigate, model bias while overcoming shortages in training or fine-tuning data—particularly for domain-specific data, which is often scarce.

A synthetic dataset is artificially created data, devised to be statistically similar to an existing dataset. Since it closely resembles, yet doesn’t duplicate, that original dataset, identifiers such as specific names or addresses aren’t in the synthetic dataset. Because it’s statistically similar to the original dataset, it provides an ideal form for analyzing or training models with that data—without compromising data privacy, data security, or regulatory compliance.

“It’s all about enriching and enhancing data for your AI development efforts, your analytics, and anything you’re using to feed into your decision-making process and, of course, troubleshoot,” commented Brett Wujek, principal data scientist with the Artificial Intelligence and Machine Learning division at SAS R&D. “It’s all about a foundation for AI.”

The synthetic data concept is noteworthy because it’s emblematic of the full potential of AI which, contrary to popular belief, is not solely reliant on statistical methods and the deep neural networks that many associate with this technology.

There’s a variety of approaches for producing synthetic data, which not only include probabilistic machine learning but also non-statistical, rules-based approaches. There are even rules-based approaches (that are typically non-statistical) which are fundamentally statistical in nature, proving that there’s more than just two types of AI.

PRIVACY AND LEGAL RAMIFICATIONS

The breadth of use cases enabled by synthetic data spans a wide gamut. In addition to the data science necessities for training and fine-tuning models, this data supports simulation scenarios (ranging from environmental concerns to computer chip manufacturing). The most widely adopted application likely pertains to privacy and legal matters, because of the absence of sensitive information in synthetic data.

In highly regulated verticals such as healthcare, synthetic datasets are critical for performing analytics on data that would otherwise contain personal identifiable information (PII). Franz CEO Jans Aasman points to MITRE, a not-for-profit company, which “looked at all the people in Massachusetts and analyzed all the data and regularities in the data, then created a model. Now, they use that model to generate patient data that is almost like normal patient data, except it can’t lead you back to a person because it was generated by an AI model.”

Synthetic data delivers similar advantages for working with proprietary data, particularly for language model deployments of retrieval-augmented generation, question-answering, and summarization. Organizations can render their data anonymous to fine-tune models or supplement prompts with synthetic data. According to Mostly AI CEO Tobias Hann, “For question-and-answer pairs, prompt response pairs, you have those in a proprietary, original data format. You create a synthetic version of that, [which] is still representative and contains the relevant information, but is fully anonymous, doesn’t contain any PII, and then you use that in different downstream tasks.”

When generating synthetic data to circumvent legal and privacy issues, there’s a triad of considerations for ensuring success. “When you look at synthetic data, you want to look at the accuracy, how close is it to real data,” Hann disclosed. “You want to look at privacy and then you want to look at performance. When you have large datasets, how does it scale?”

EAIWorld Cover
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Companies and Suppliers Mentioned