Modern Data Management for the AI Era
It’s well established that AI—both operational and generative AI (GenAI)—is changing the game for data management. Today’s data environments need to support large (and not-so-large) language models that have to be constantly trained with inflows.
At the same time, AI serves as a vehicle that supports more effective and more far-reaching best practices for data management. Operational AI is proving to be a way for better automating the essential processes that support modern data environments—from security to cost management, indexing, and provisioning.
In addition, GenAI will make it light years easier to open up queries against enterprise data, either through natural language processing or simpler construction of SQL-based commands. Plus, GenAI promises to make it easier to pull essential data out of silos and scattered systems.
Supporting AI
Building, supporting, and sustaining language models in enterprises requires the ability to move data quickly and efficiently from established data environments to the language models being employed to deliver GenAI and the statistical models supporting machine learning.
Machine learning—which typically is employed more behind the scenes within production equipment, vehicles, robots, and analytics tools—is more common that GenAI at this point, a recent survey by Databricks finds. And, overall, the number of machine-learning models is up across the companies in the study. Among this group, 1,018% more models were put into production this year, far outpacing experiments logged, which grew 134%. The average organization registered 261% more models and logged 50% more experiments this year.
In addition, the study shows 70% of companies that are leveraging GenAI are employing vector databases to augment their base models. “Companies are hyperfocused on customizing LLMs with their private data using retrieval augmented generation (RAG),” the study also showed. RAG typically requires vector databases, which grew 377% over the past year.
RAG, which extracts information from databases, is seen as a way to extract more value and accuracy from LLMs. GenAI has encountered issues such as outdated information and hallucinations in the output it delivers.
A key data management strategy essential to the delivery of AI-based solutions is data governance. Data needs to be timely, of the highest possible quality, and well-tuned to the business requirement at hand. A metadata layer needs to be put in place to ensure that data is available and accessible for consumption by AI systems.
Ultimately, people are the key to AI success. AI training at all levels is essential, as employees will either be building, maintaining, and training AI models or enhancing their jobs with AI-enhanced systems.
AI for Data Management
Ironically, AI plays a role in delivering and supporting AI. The technology, in all its forms, promises to bring higher levels of automation, querying capabilities, security, and analytics to data environments. Operational AI serves as a means to automate the data pipeline that is key to these processes.
“Managing data … is a labor-intensive activity: It involves cleaning, extracting, integrating, cataloging, labeling, and organizing data, and defining and performing the many data-related tasks that often lead to frustration among both data scientists and employees without ‘data’ in their titles,” write Thomas H. Davenport and Thomas C. Redman in MIT Sloan Management Review.
Areas in which AI can play a role in data management include the following, as identified by Davenport and Redman:
- “Classification: Broadly encompasses obtaining, extracting, and structuring data from documents, photos, handwriting, and other media.
- “Cataloging: Helping to locate data.
- “Quality: Reducing errors in the data.
- “Security: “Keeping data safe from bad actors and making sure it’s used in accordance with relevant laws, policies, and customs.
- “Data integration: Helping to build master lists of data, including by merging lists.”
AIOps—the employment of AI to manage IT operations—plays a role in managing, analyzing, and automatically course-correcting data management systems. It employs AI to manage and accelerate IT and data systems capabilities to deliver services and functions.
A key role of AIOps is performance monitoring, root-cause analysis, anomaly detection, and reducing alert fatigue. In addition, it speeds up administrators’ ability to resolve glitches. Perhaps most importantly, AIOps is a collaboration vehicle, bringing together data management teams, IT teams, and data scientists to mutually identify problems and solutions that deliver the best customer and user experiences.
Another collaborative process, machine learning operations (MLOps), is intended to boost interaction between data/IT teams and data scientists to develop and maintain machine-learning systems.
GenAI is opening new avenues of database capabilities—the ability to create SQL commands through natural language prompts. Creating SQL statements is a notoriously difficult and cumbersome task, requiring a level of training and skills. Now, vendors are introducing tools that enable users to develop and use SQL commands through simpler prompts.
The prompts employed to enable AI-driven SQL, however, need to be precise. This increases the need for prompt engineering to help both data analysts and end users get the data they need.
There are other AI-related promising solutions that promise to make accessing and sifting through data more tenable. MIT researchers, for example, recently developed a tool targeted directly at database queries. The tool, called GenSQL, is intended to make it easier for database users to “perform complicated statistical analyses of tabular data without the need to know what is going on behind the scenes,” the researchers state. The tool “could help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes.”
The researchers offer the following use case: “[I]f the system were used to analyze medical data from a patient who has always had high blood pressure, it could catch a blood pressure reading that is low for that particular patient but would otherwise be in the normal range. GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adjust their decision-making based on new data.”
Data management is emerging as the most important discipline for creating and supporting AI, or, conversely, employing AI to improve its performance. Either way, AI is changing the mission for the data managers of today and tomorrow.