Taming the Data Quality Issue in AI
SOLUTIONS
The good news in data quality is that in recent years, “there are advancements in technologies that are better equipped at implementing efficient, quality-driven, and automated data management,” said Mezzetta.
What kinds of approaches, methodologies, or solutions are needed to raise data quality supporting AI systems? It all starts with good governance, industry experts stated. “Good governance policies help create high-quality data needed to ensure data is fit for use,” said Schauf. “We need to have data quality rules that check data formats, telephone numbers, dates, or Social Security numbers. Standardization rules will make sure the data is in the best shape possible for an LLM.”
Good governance is a product of corporate culture. “Beyond any one set of methodologies, solutions, or approaches to raise data quality standards to support AI, what is truly required is a shift in mindset for a business,” said Karan. “Fixing data quality is not a one-time, 12-week program that will solve the issue once and for all. It is a continuous effort that requires defined processes and tools to sustain as a business priority.”
Data governance should center on “what data can or should be included or excluded, document age ranges, gender diversity, or any other specific tokens needed,” said Schauf.
Also needed is “a combination of approaches to raise the quality of data supporting AI systems,” said McLarty. “Involving business stakeholders from the start ensures that all data processes yield distinct outcomes, and that metrics for tracking progress are clearly defined.”
This includes “policies that specify data handling procedures, access controls, and quality standards,” McLarty continued. “Engaging individuals and identifying data custodians for each record can help control and mitigate the risk of errors.” Ideally, managers and architects should endeavor to “make it easy to include stakeholders without requiring them to learn new technical skills or adopt cumbersome processes.”
Communication between business and data leaders is crucial of course, said Malakar. “Data leaders must emphasize the importance of simplifying and organizing the data infrastructure. The emergence of chief AI officers and chief data officers helps bridge the gap, ensuring AI projects are aligned with a strong data strategy.”
In addition, just get “started on your first AI project—and work through it, knowing this will expose some of the data quality and data management issues a company may have,” said Daigle. “This allows for them to be solved along the way. This approach will make an organization’s first AI projects lengthier and involving of a process, but investing the time upfront will reward your organization later on.”
The next step is to “to ensure standardization rules and filters adhere to all rules and policies,” he added. “This confirms that records in a table belong in training an LLM destined for an AI system.”
Once the data is fit for use and adheres to the project’s governance policies or business rules, “work can begin on extra filters to better shape the data,” Schauf said. “That way, when it’s fed into an LLM or the AI system itself, there won’t be issues, missing elements, or other problems.”
It’s also important to “evaluate the output compared to the input,” said Schauf. “Does the output meet parameters? If yes, the model can move into production. It’s also important to build repeatable processes so that the data could be regenerated without having to start from scratch.”
Data-centric AI strategies “enable organizations to provide experiences that are grounded in the freshest data, significantly improving the output of AI models and reducing hallucinations,” said Hong. “Data-centric strategies include data scrubbing and ensuring the data represents the real-world scenarios where the model will be deployed.”
Finally, effective deployment of retrieval-augmented generation (RAG) “enables AI models to pull data from external sources, giving LLMs the necessary context and information required to improve responses and avoid hallucinations,” said Hong. “However, RAG needs access to real-time data to ensure the output is as current and accurate as possible.”
At the same time, “standard data methods and tools cannot be applied unchanged to AI,” said Mezzetta. “Enterprises need proper tooling to prepare, refine, cleanse, merge, and reuse data for AI. This includes modern and open data architectures that unify and process data efficiently in a secure, cost effective, and open manner to enable ease of sharing. AI-infused data governance can aid in data discovery and data validation, making it even easier for business users and others who lack technical expertise. Data sharing should be a scalable, reusable component to keep up with the data needs from AI.”
To download the full Enterprise AI Sourcebook, please click here.
Companies and Suppliers Mentioned