Taming the Data Quality Issue in AI
AWARENESS
When it comes to data quality in AI, there simply isn’t enough awareness of the issue as people charge ahead. When it comes to raising awareness about the data feeding AI systems, “there’s still a long way to go,” said Mezzetta. “In many organizations’ rush to innovate and their eagerness to gain an edge amidst the AI market hype, they may be underestimating the attention their teams need to direct on the data collection and handling phase of the AI development lifecycle—or the demands they need to set from their AI vendors.”
Database managers “who work with the data every day are certainly aware of the issue, but this has not spread to the leadership level like it needs to,” said Robert Daigle, director of global AI business at Lenovo. “I’ve seen that business leadership typically holds assumptions that a lot of the challenges around data have been fixed, whether its siloed data access, governance quality, or management resiliency. Companies often assume they are in better shape than they actually are.”
Compounding the issue is that people “will often be blinded by the technology and assume it’s always right,” said Schauf. “If an AI hallucination is in a coding example through a browser, the risk is low. If the hallucination is in your vehicle’s GPS telling you to turn right, the wrong data could mean you land in the middle of a lake. Software companies need to adopt AI technology, but never stop testing it to ensure answers are valuable.”
Business and technology leaders “tend to focus on the end of the AI value chain,” but to get there, they need data transparency and full traceability of data outputs, Mezzetta said. “This is a complex process, but one that is critical to the success of an AI system. It is essential for businesses to ensure they are using the correct data, understand its origin, track its evolution, and identify any inconsistencies in data flows.”
ROOT CAUSES
The root causes of data quality issues are diverse. This could include “inconsistent collection methods, siloed systems, and inadequate governance,” said Ghrist. “Outdated or biased data compounds these problems, and malicious data-poisoning makes it all worse. Improving data quality for AI requires a multifaceted approach.”
The reasons behind data quality issues “boil down to ‘garbage data in will produce garbage output,’” Mezzetta explained. “Reliable and trustworthy AI needs a high-quality and reliable data foundation. The root causes of data issues stem from not having the right data management framework and technologies to execute the right level of data unification, data validation, data security, and data governance.”
Such lapses may occur “due to budget constraints, competing priorities, or a lack of awareness regarding the depth of data management requirements,” said Matt McLarty, CTO for Boomi. “Additionally, some may mistakenly believe that technology alone can maintain data quality without the need for human expertise and context.”
Low-quality data also comes from data sparsity, Hong pointed out. “Incomplete or missing datasets force AI models to fill in the gaps, potentially leading to inaccurate outputs or hallucinations. While a human has the ability to recognize and address these gaps, AI models do not possess the critical-thinking abilities to navigate and address these issues.”
In addition, the proliferation of data silos “impacts the quality of data by complicating and slowing access to information,” Hong continued, “preventing businesses from accessing the true value of their data and AI models from leveraging the most up-to-date and relevant data.”
Data bias and outdated data severely corrupt AI output, Zoldi added. “LLMs [large language models], given their expense of development, are typically trained on fixed-age data that might not reflect current or correct facts, producing inaccurate results and further-dated inaccuracies.”
Data bias “is one of the most significant issues with models today and made more troublesome in LLMs, where data provenance and data filtering are often suspected,” said Zoldi. “State of the art in responsible AI development assumes all data is a liability, and data is dirty and biased, thereby very possibly inputting bias in learned latent relationships. This can be controlled in smaller traditional AI models by examining activations of latent features, but the same processes are too exhaustive for large language models.”
According to Katan, bad business processes are another source of low-quality data. “Where bad business processes exist, it is nearly certain that data caused by human error will go unchecked, a lack of standards or governance will let inaccurate data go unvalidated, and shadow teams will create dark data that is out of compliance.”
Data quality issues “have been around forever, but it’s not getting better with time,” said Schauf. “People look at a table with blank or null cells and think it must be OK. But they should be asking if the data in the columns is fit for use.”
Schauf gave this illustration: “If I’m working on a table with a date-of-birth field in 2024, a quick scan looking for nulls or blanks will say the column is complete and let the table be used to train a large language model. But if the date of birth is 1/1/1865, that date isn’t fit for use and should be excluded because that individual would be 159 years old. The person doing data entry likely meant to year to be 1965, which means the subject is 59 years old and within the fit-for-use parameter.”
A simple data-quality rule “could identify that errant record and exclude it from the dataset being used to train the model,” Schauf continued. “This type of inclusion or exclusion often isn’t done on the data training models. Fundamental data quality in conjunction with a little data governance often cleans up a dataset to make it a high-quality training set for an LLM.”
Data quality “has several dimensions which impact AI—including uniqueness, accuracy, completeness, timeliness, validity, and consistency,” said Mezzetta. “These dimensions have a long history and have been vetted through several data management evolutions. Without the proper focus, these dimensions will impact the integrity and trustworthiness of the data leveraged in AI. Data that isn’t timely, therefore outdated, will drive incorrect decision making, potentially leading to wasted time and financial investments. In the worst-case scenario, data that isn’t protected can lead to major regulatory fines from the improper use of data in AI.”
Poor data quality “is one symptom of a few related issues that will impact a business’ ability to deploy AI solutions safely at scale,” said Katan. “These include gaps in data cataloging, data lineage, master data management, and data privacy and protection. The root ailment of all these symptoms is ineffective enterprise data governance and gaps in data strategy.”
Many businesses are still rushing into AI initiatives “without fully understanding the complexities of their data environment,” said Diby Malakar, VP of product management for Alation. “The modern data landscape is increasingly complex, with organizations investing in various tools and software to derive value from their data. This often leads to chaotic, fragmented data spread across different systems and business units, making it harder to access and trust the necessary data.”
Companies and Suppliers Mentioned