Adding Meaning to Data: Knowledge Graphs, Vector Databases, and Ontologies
PROVIDING GROUND TRUTH
One way of reducing incorrect answers is to provide the LLM with a source of ground truth—the knowledge and facts from a support knowledgebase, for example—to provide customers or customer support reps with the correct information.
The LLM retrieves the information from that source rather than from its own knowledge of the world represented in the vector space of the model. There are many ways to do this. One way is simply to query a knowledge source using standard full-text or faceted search. The challenge is the same one that search has always had—missing or poor-quality content—meaning, it is not-well tagged or structured enough to return specific answers to questions.
Another way to treat content is to ingest it into the vector space. Content needs to be broken up for ingestion, and if those components are well-tagged and grouped into semantically meaningful chunks (an answer to a question, for example), the results will be more accurate. Instructing the LLM to only answer from the knowledge source and to answer, “I don’t know” when information to answer the question is missing will significantly reduce, if not eliminate, hallucinations.
METADATA SIGNALS AND ONTOLOGY SCAFFOLDING
Applying metadata is very important, because metadata forms additional signals that provide nuance and context for that content. In fact, in one study, my company found that with the use of metadata-enriched embeddings, an LLM was able to answer questions from a knowledge source with up to 83% accuracy, compared to just 53% without it. This study was based on a gold standard set of 60 use cases that models were tested against in instances when we knew the correct answer.
An ontology is used as a reference source for metadata and controlled vocabularies of content architecture. The ontology describes a domain of information by considering the “big picture” principles representing what is important to the organization. Different domains will have different “buckets,” or categories, representing the “domain model.”
For example, the domain model for an insurance company will have products, services, content types, customer types, risks, operational regions, and so on. A pharmaceutical company will have biochemical pathways, generic drugs, branded drugs, chemical names, diseases, indications, treatments, symptoms, drug targets, and mechanisms of action. An industrial manufacturer will have product types, product attributes, industries, customer types, processes, environments, and other entities. When the various vocabularies (the terms that populate each entity in the domain model—the list of product categories, for example) are created, the result is an enterprise taxonomy. By describing relationships among taxonomies (indications for a disease or risks in a region), we can build an ontology that consists of all the vocabularies in the domain and the relationships among them.
The ontology forms the knowledge scaffolding of the enterprise. Using that scaffolding to access and organize data and content provides a knowledge graph. The knowledge graph becomes a reference for content models, tagging of information, and the content and data itself. When integrated with an LLM, the knowledge graph becomes the ground truth for the LLM. It is an access point for reference information and the “source of truth” for retrieval using an LLM.
IMPROVING DATA QUALITY WITH RAG
Once the reference ontology is designed, LLMs can also be used to improve data quality and data fill. Many sources of information can be normalized and contextualized using ontologies.
These sources include knowledge articles, specification sheets, webpages, troubleshooting guides, industry standards, style guides, user profiles, and user behaviors— the telemetry from clickstreams, searches, campaign responses, call center history, and more. The mechanism for fixing the data uses a form of RAG in which the prompt is unenriched data, and the result is enriched data. We use an approach called “modular RAG,” which uses multiple, state-of-the art algorithms to process data and programmatically generate data enrichments.
The modular RAG approach uses ontologies as a reference point for controlled vocabularies and relationships between products, for example (accessories, related products, solution kits, and more), as well as attributes (specifications or additional elements) that describe product details and applications.
AI RUNS ON DATA
Industry experts have long advised organizations that AI runs on data. Organizations are becoming more aware of the foundational requirements of AI applications from a data quality, architecture, and governance perspective. Disappointment resulting from AI project failures is leading to a re-evaluation of approaches and a reassessment of expectations about what AI can, and cannot, do. The beginning point is an ontology that forms the “knowledge scaffolding” for the organization—the organizing principles and vocabularies used to tag content and structure data—and the ways that entities relate to one another.
Building that reference is a starting point to make sense of and contextualize information. The goal is to make information more contextually relevant—whether that is customer-facing content to assist in product selection, related products on an ecommerce site, troubleshooting information for field support, or strategic plans for a senior executive. The goal of the right information for the right person is the goal of context. With LLMs and GenAI, we have a greater ability to ingest signals and use “digital body language” to provide that in context information. But the starting point is good data, or, at the very least, good data architecture, that can be leveraged to produce good data to drive AI-powered applications.
To download the full Enterprise AI Sourcebook, please click here.