Taming the Data Quality Issue in AI
Data quality is the showstopper of AI. Many enterprise leaders who were hot on the business potential of AI are realizing that their efforts will be dead in the water if the data they are employing to train and populate their AI models is inadequate, inaccurate, or not timely.
“While not always a deal-breaker, poor data can severely limit AI effectiveness,” agreed Robert Ghrist, associate dean for undergraduate education at the University of Pennsylvania School of Engineering and Applied Sciences.
While investments in generative AI (GenAI) “are being urgently prioritized, unfortunately organizations are not paying enough consideration to an underlying data strategy that unlocks the value and largest returns,” said Drew Firment, cloud strategist at Pluralsight. “As a result, very few enterprises can clearly demonstrate the business value of their AI projects beyond chatbots.”
Data privacy and security are overwhelmingly the top concerns of 1,600 decision makers across key global markets who are contemplating AI, as shown in a survey published by SAS. More than three-quarters, 76%, cited concerns with data privacy, and 75% cited concerns with data security.
Governance was seen as an issue by 56%, and 52% are concerned with ethical implications. The SAS research finds that only one in 10 organizations has undergone the preparation needed to comply with GenAI regulations. In addition, 95% of businesses lack a comprehensive governance framework for GenAI.
AI is only the latest twist in the ongoing quest for data quality, which has been “the biggest reason for project failure for decades,” Mickey Schauf, senior manager for data quality at SAS, pointed out. “When it comes to AI projects, if we don’t feed high-quality data into the model, the answer will be wrong. If that leads to a low grade on a high school term paper, that’s low risk. If we’re looking at AI models to define the parameters on the next miracle cancer drug, the risk is incredibly high.”
IN TRAINING
How much of a problem, exactly, does the quality of data pose to AI projects these days? Today’s most popular model-centric AI strategies “prioritize building AI algorithms for performance, focusing on optimizing the model’s architecture and parameters,” said Chin-Heng Hong, VP of product management at Couchbase. “However, this strategy can overlook the data quality and relevance used for training AI models, resulting in models lacking resilience when presented with new data.”
AI and machine learning models “are representations of the data on which they are trained; if the data quality is insufficient, or outcome data incomplete, the resulting model is similarly corrupt,” agreed Scott Zoldi, chief analytics officer at FICO.
AI models require massive amounts of data, “which is mostly unstructured and ungoverned,” Firment said. “Many enterprise architectures lack a modern data strategy and are not ready for the complexity and intense computational demands of AI workloads. As a result, the quality and integrity of the underlying data generates outcomes that are often unreliable, unpredictable, and outdated.”
Until recently, “there has been a tendency to brush the issue of data quality under the rug,” said Steven Karan, VP and head of insights and data for Capgemini Canada. “However, now the adoption—or rather the attempted adoption—of AI projects has placed a bright spotlight on the problem. Recent forays into AI have only amplified the awareness of the impact of poor data quality.”
While issues such as failing to deliver business outcomes or an inability to scale beyond the proof-of-concept stage are top of mind, the real culprit that needs to be addressed is “poor data quality,” said Karan. “Poor data quality not only impacts the real data that is ingested by algorithms but also the quality of the synthetic data that is required to properly train AI models.”
Data used to train AI models “is just as important as the methods used to develop it,” said Sonia Mezzetta, director of product management for watsonx.data at IBM. “Ensuring data quality means training AI models on highly curated data with controls and filters in place to ensure data accuracy, completeness, and provenance, while also lowering the risk of model hallucination. As part of this, it’s crucial to amass a large, relevant dataset and [ensure] that the system is designed for transparency and explainability.”
Companies and Suppliers Mentioned