Accelerating AI Development With Synthetic Data
MODEL DEVELOPMENT
Synthetic datasets are becoming increasingly integral to the AI model development process. Advanced machine learning models require inordinate amounts of data to train and, to a lesser extent, fine-tune for enterprise production levels. Synthetic data is ideal for assuaging situations in which “you don’t have sufficient data, enough of the right data, to be confident in the analytics you’re doing,” Wujek mentioned. This advantage is redoubled for use cases in which there is a dataset available, but it’s disproportionately less representative in some areas than it is in others.
For example, there may be ample quantities of training data of certain population demographics for a model for approving loans or mortgages, but not enough data from other population segments. “If the data itself is biased, and by biased it’s focused on one group or the positive impacts are seen by only one group, when you develop a model and start making decisions off that model, those decisions are just going to perpetuate that bias,” Wujek noted.
However, so long as there are some data for underrepresented groups (the synthetic data generation process necessitates input data), synthetic data can mitigate that bias by increasing the amount of training data for those underrepresented groups. Additionally, the simulation proficiency of synthetic data is optimal for determining how new products or services might function in production settings. “If you’re a developer working on mobile apps for banking, you typically don’t get access to production data,” Hann revealed. “You have to develop those apps with dummy data, test data, or maybe use your own information. In this scenario, synthetic data helps give you a clearer, better picture of what your customers will see when they use the app.”
MODEL TESTING AND SIMULATION
The distinction between using synthetic data to test models and for simulation purposes may be nominal. Suppose you ask a language model-driven “answer engine” about how chip manufacturers rely on synthetic data. “If you ask this question and I change the words ‘synthetic data’ to ‘simulation data,’ I would get exactly the same list,” Aasman conjectured. “The word ‘synthetic’ is interchangeable with ‘simulation.’ It’s exactly the same thing.”
There are multiple ways in which synthetic data aids the model testing process. The most utilitarian may be its capacity to generate sufficient data amounts “so you have a greater test set to work on,” Wujek said. This advantage is particularly helpful after organizations have designed customer-facing applications involving advanced machine learning models, although it certainly applies to other use cases too. Because synthetic datasets can be highly detailed in their similarity to an original dataset, they can even be used to simulate or generate artificial customers to assess how models perform.
“On the back end, it sure would be nice if we could test some of those things out on cases that don’t impact real people,” Wujek pointed out. “So, [with] model validation, and the impact of model predictions and downstream decisions, you can play that out on synthetic data and synthetic representations of people, as opposed to real people.” In fact, the data generated by synthetic data can be so realistic, it can even simulate circumstances or scenarios that are not commonly found. In this respect, its simulation capabilities can help prepare data scientists for almost any situation before operationalizing models. “We can generate observations that are maybe more corner cases, rare events, that really allow us to stress-test the model that we’re using,” Wujek said. “What if this model were used on some certain case? What would the outcome be for things we just hadn’t thought about?”
This characteristic makes synthetic data widely used for simulation scenarios for everything from modeling climate change to facets of computer chips.
SYNTHETIC DATA APPROACHES
The methods of producing synthetic data span the spectrum of AI techniques. Most synthetic data deployments rely on statistical models at the scale and compute of deep learning.
However, these machine learning approaches aren’t limited to deep neural networks. According to Aasman, “Any machine learning models that are meant to do predictions can create synthetic data. But, whether that’s the best way to do it, that’s another thing.” It’s even possible to engender synthetic data without machine learning, or statistical AI altogether.
These are some of the ways to create this form of data:
- Non-Statistical Rules-Based Approaches: Non-statistical rules are the centerpiece of many symbolic reasoning systems. They can be applied to make synthetic data in situations in which “you have known patterns for a certain attribute,” Wujek explained. “If there are known statistical distributions, you can codify that so when you go to generate values for that attribute, it follows the pattern or distribution.”
Rules-based modes of creating synthetic data rely on human knowledge and experience. “If you’re trying to define some values for a location, you need to make sure the state and the ZIP code make sense together,” Wujek remarked. “Or, if you’re even just creating names for an observation, an English last name with an Asian first name probably doesn’t make much sense.” - Statistical Rules-Based Methods: Aasman mentioned that the state transition machines model underlying Synthea, the synthetic data generator for MITRE, is rules-driven, yet inherently statistical. This fact is significant because usually, AI is termed as either statistical or non-statistical, with most rules-based deployments being non-statistical. “This is kind of in between,” Aasman commented with regards to the state transition machines model. “This is statistical. Say I have state A and then B and then C. Then, I can say if I’ve got A and B and C, then the most likely thing is E or F or G, based on the last three or four or five hundred states.”
- Statistical Approaches Without Rules: Most statistical AI approaches involve machine learning. The vast majority incorporate neural network architectures.
Generative adversarial networks (GANs) are broadly used to generate synthetic data. Language models, including large language models (LLMs), may also be involved in creating synthetic data. According to Hann, “In the space of text data, we can use any language model on Hugging Face. Those language models are then fine-tuned with the text data in the context of structured data. Then, those fine-tuned LLMs are actually used to create that synthetic data.” - Pairing Statistical and Non-Statistical AI: There are also applications in which users invoke both statistical and non-statistical AI methodologies.
Wujek described such a situation: “There are multiple tables that are related in some way and need to be fused to feed the generative process that we’re trying to replicate. There’s going to be columns in there that need to be driven by rules that we would not want to go over to an algorithm in general to generate.” After fusing those tables, machine learning models can produce synthetic data from the resulting table.
Companies and Suppliers Mentioned