-->

NEW EVENT: KM & AI Summit 2025, March 17 - 19 in beautiful Scottsdale, Arizona. Register Now! 

Adding Meaning to Data: Knowledge Graphs, Vector Databases, and Ontologies

Article Featured Image

This article discusses the importance of context in enterprise data for generative AI (GenAI), and, in fact, for any AI initiative.
There are common misconceptions about how AI operates and how enterprise data needs to be curated (or not). The roles of reference data, data architecture, and ontologies are critical to enabling accurate and useful retrieval of information in the enterprise.

THERE IS NO FREE LUNCH: THE PERENNIAL PROBLEM OF INFORMATION CURATION

During the past few decades, organizations have been struggling with capturing, managing, and accessing unstructured data—the knowledge artifacts that embody how the organization serves customers and solves problems. This information can be in the form of highly technical solutions, specifications, and manufacturing procedures, but it also includes go-to-market strategies and tactics, human resources policies, and everything else in written form.

While structured and transactional data has been managed and curated more effectively, many challenges remain around data silos, inconsistent data models, varying semantics, and definitions, as well as how analytics is managed and operationalized.
Each generation of new technology (knowledge portals, semantic search, data warehouses, data lakes, graph data, knowledge graphs, and now large language models, or LLMs) promises to solve the problem. But none of these approaches can fix what is foundationally, fundamentally flawed—the data hygiene and content curation processes that have been perennially underfunded and under-resourced.

THE FORCES OF ENTROPY

Building on this messy, not-well-managed world of data and information is difficult, no matter what the application. Of course, every organization has challenges with its data quality at some level, although some are in better shape than others. Usually, enough data architecture and content/data curation and cleansing is done to launch the system, but without ongoing curation and management processes to measure and remediate data issues that naturally crop up across time, the new system will gradually succumb to the forces of entropy.

Across time, it will become less and less useful. A new content or knowledge system starts out great—a nice, new, clean environment— but without the correct processes and measures, it gets messier over time.
Unfortunately, without sufficient funding to solve challenges at their source, the IT organization is left with little choice but to build upon inconsistent, poor-quality, and missing data. Many times, a new project catalyzes a cleanup effort but doesn’t truly address the information governance gaps that defeat the cleanup across time.

As a result, customer support organizations carry on with poorly curated knowledge, leading to higher costs and lower customer satisfaction, and sales organizations struggle with locating the most effective collateral. Due to variations in definitions and semantics, analytics teams produce the same analyses repeatedly rather than cataloging what they already have.

ENTER AI

Add to this the forces of rapid advances in AI, specifically GenAI and LLMs, plus the need to develop competitive advantage. Not wanting to be left behind increases the urgency to do something.
In some cases, organizations are attempting to train models with their own content, but training a model from scratch can be costly and difficult and lead to unexpected outcomes.

The underlying problem lies in the quality of training content and data. What exactly is training data? Well-curated and structured content and data assets containing the correct descriptors (metadata)—the very things that people are compensating for when deploying LLMs.

There is a misconception that LLMs alone will solve poor quality and missing data. That is partly true. An LLM can be used to improve product data, for example, but requires the correct context to do so—product names, categories, and attributes, along with related content and knowledge artefacts.

WHAT IS RAG?

Retrieval-augmented generation (RAG) improves LLM performance by providing a source of truth for the model.
When you ask an LLM a question, the mechanism it uses is based on a representation of the world that is created by ingesting enormous amounts of content from the internet. Deep neural networks and a variety of mechanisms for statistically predicting what words are most likely to relate to the query can be used to process this content. That is an overly simplistic explanation. However, at the core, ChatGPT and other LLM-based tools are prediction mechanisms. Their ability to provide human-sounding answers is due to mathematical equations that treat text as a series or numeric values and then iteratively operate on those values to provide an answer.

However, if the answer is not contained in the LLM’s understanding of the world, it will still provide a response—one that sounds reasonable or is statistically likely—but is potentially factually incorrect. These are the so-called “hallucinations”—answers that do not have a basis in fact.
The LLM has a representation of the world that does not necessarily contain corporate information (unless that information has been made publicly available). When a question is asked (a prompt), it is ingested into a vector space. A vector is a mathematical representation of text. Ingesting text and converting it into a series of numbers in the vector space is referred to as “embedding.” Vectors capture the nuances of language by modeling “features” of the content. Features are essentially metadata—the “about-ness” of a piece of information.

Features represent multiple dimensions of information. Vector representations can have hundreds, thousands, or even tens of thousands of dimensions. It is difficult to think in more than three dimensions (four, if we add time), but mathematically, it is feasible. The greater the number of dimensions, the more complex the text that the LLM can process. The greater the number of dimensions, the greater the nuance captured in the data.

In vector similarity search, the prompt vector is compared to other terms and phrases in the vector space. The vectors that are closest in “n-dimensional space” form the basis for the output. By using additional signals (essentially, metadata) from customer segments, industry, interests, behaviors, preferences, and more, the closer the model can get to that conceptual “location” in multidimensional vector space.

Think of the GPS in your car. It provides directions to get to a specific geographical location. If you want it to take you to a particular location with certain characteristics (say a moderately priced Italian restaurant with good reviews) those additional clues will help the GPS guide you to the correct location; those additional user preference signals will give the GPS more context. It is, essentially, navigating in “n-dimensional” space—each characteristic, price, cuisine, rating, etc., provides additional dimensions for the query. The same thing happens in the vector database. The more details we provide (through more specific prompts or through customer behaviors, preferences, configurations, prior requests, etc.), the closer we get to the correct output.

EAIWorld Cover
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues