-->

NEW EVENT: KM & AI Summit 2025, March 17 - 19 in beautiful Scottsdale, Arizona. Register Now! 

Alluxio Enterprise AI 3.5 Heralds Performance Efficiencies for AI Workloads

Alluxio, the AI and data acceleration platform, is debuting a series of enhancements to Alluxio Enterprise AI, the company’s high-performance distributed cache optimized to accelerate AI. Aiming to tackle the most common challenges in driving AI success, Alluxio’s latest updates streamline the complexities of AI infrastructure operations.

At its core, Alluxio accelerates AI by solving the speed, scale, and resource scarcity challenges with high-performance, distributed caching, and unified access to diverse data sources. This is especially critical due to today’s state of enterprise data, which continues to grow in volume and complexity. As far as that relates to AI initiatives, organizations are faced with a plethora of data inefficiencies that hinder model performance and development.

“The latest release of Alluxio Enterprise AI is packed with new capabilities designed to further accelerate AI workload performance,” said Haoyuan (HY) Li, founder and CEO of Alluxio. “Our customers are training AI models with enormous datasets that often span billions of files. Alluxio Enterprise AI 3.5 was built to ensure workloads perform at peak performance while also simplifying management and operations of AI infrastructure.”

Enterprise AI 3.5 now features a new caching mode—CACHE_ONLY Write Mode—which is designed to significantly improve the performance of write operations. With the ability to write checkpoint files during AI model training, enterprises can restart a training job to the closest point before failure—ensuring that training is not forced back to the beginning of the process.

“These AI [training] workloads take a really long time, even when the infrastructure is running at peak, and all the bottlenecks have been removed—they're just long. They can take sort of days, weeks, and in some cases, multiple months,” explained Bill Hodak, VP of marketing and product marketing at Alluxio. “[The new caching mode] periodically writes these checkpoint files to save where they are in the workload, so that, if needed, they can come back to that point.”

Additionally, the new mode writes data exclusively to the Alluxio cache instead of the underlying file system (UFS), improving write performance while eliminating various bottlenecks associated with large files.

Hodak further contextualized the importance of this new caching mode, explaining that “Alluxio has been helping our customers primarily with reading data…which is obviously the biggest part of the bottleneck. But as our customers start to solve that problem, this idea of the checkpoint writing starts to appear. So, it's sort of like, whack a mole, right? We solved the read slowness problem, and then…what's the next bottleneck? And that turns out to be the writing.”

Ultimately, AI “workloads, by nature, take a pretty long time compared to non-AI workloads…being able to resolve performance bottlenecks throughout that entire process is critical to having end-to-end speed,” Hodak continued.

Enterprise AI 3.5 also features various cache management features, including TTL Cache Eviction Policies and Priority-based Cache Eviction Policies. TTL Cache Eviction Policies enforce time-to-live (TTL) settings on cached data, optimizing cache efficiency by automatically evicting less frequently accessed data. With Priority-based Cache Eviction Policies, enterprises can guarantee that specific data stays in the cache even if that data would be evicted based on the Least Recently Used (LRU) cache eviction algorithm.

This feature gives enterprises the “opportunity to say, ‘This is the important data, please don't evict it; or this is really not important data, feel free to evict it, even if it's been used recently,’” explained Hodak. It affords businesses “a lot more control [at the] granular level to say what should stay in cache and what shouldn't.”

Another update to Enterprise AI 3.5 focuses on how data scientists, machine learning engineers, and applications interact with data stored in Alluxio. Now, Alluxio’s Python SDK supports leading AI frameworks, including PyTorch, PyArrow, and Ray, simplifying the adoption of Alluxio Enterprise AI for Python applications. With a unified Python filesystem interface, applications can engage seamlessly with various storage backends, both local and remote.

“When machine learning engineers [and] data scientists are writing their training code in Python, and their data is residing in Alluxio, they just interact with it as if it were any other file system. There's nothing special that they need to do…making it much, much more seamless for them,” said Hodak.

Outside of these key features, Alluxio is unveiling several updates to its S3 API, as well as various other capabilities, including:

  • Support for HTTP persistent connections, reducing the overhead of opening new connections for each request and decreasing latency by approximately 40% for 4KB S3 ReadObject requests
  • TLS encryption for communications between the Alluxio S3 API and the Alluxio worker, enhancing secure data transmission
  • Multipart upload (MPU) support which splits les into multiple parts and uploads each part separately, simplifying uploads and improving throughput for large files
  • Alluxio Index Service, a new caching service that improves the performance of directory listings for directories storing hundreds of millions of files and subdirectories
  • UFS read rate limiter which allows administrators to set a rate limit to control the maximum bandwidth an individual Alluxio Worker can read from the UFS
  • Support for heterogeneous worker nodes, affording administrators greater flexibility in configuring clusters and offers improved opportunities to optimize resource allocation

To learn more about Alluxio Enterprise AI 3.5, please visit https://www.alluxio.io/.

EAIWorld Cover
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues