LMCache
LMCache is an open source Knowledge Delivery Network (KDN) designed as a caching layer for large language model serving that accelerates inference by reusing KV (key-value) caches across repeated or overlapping computations. It enables fast prompt caching, allowing LLMs to “prefill” recurring text only once and then reuse those stored KV caches, even in non-prefix positions, across multiple serving instances. This approach reduces time to first token, saves GPU cycles, and increases throughput in scenarios such as multi-round question answering or retrieval augmented generation. LMCache supports KV cache offloading (moving cache from GPU to CPU or disk), cache sharing across instances, and disaggregated prefill, which separates the prefill and decoding phases for resource efficiency. It is compatible with inference engines like vLLM and TGI and supports compressed storage, blending techniques to merge caches, and multiple backend storage options.
Learn more
Luminal
Luminal is a machine-learning framework built for speed, simplicity, and composability, focusing on static graphs and compiler-based optimization to deliver high performance even for complex neural networks. It compiles models into minimal “primops” (only 12 primitive operations) and then applies compiler passes to replace those with device-specific optimized kernels, enabling efficient execution on GPU or other backends. It supports modules (building blocks of networks with a standard forward API) and the GraphTensor interface (typed tensors and graphs at compile time) for model definition and execution. Luminal’s core remains intentionally small and hackable, with extensibility via external compilers for datatypes, devices, training, quantization, and more. Quick-start guidance shows how to clone the repo, build a “Hello World” example, or run a larger model like LLaMA 3 using GPU features.
Learn more
CoreWeave
CoreWeave is a cloud infrastructure provider specializing in GPU-based compute solutions tailored for AI workloads. The platform offers scalable, high-performance GPU clusters that optimize the training and inference of AI models, making it ideal for industries like machine learning, visual effects (VFX), and high-performance computing (HPC). CoreWeave provides flexible storage, networking, and managed services to support AI-driven businesses, with a focus on reliability, cost efficiency, and enterprise-grade security. The platform is used by AI labs, research organizations, and businesses to accelerate their AI innovations.
Learn more
Amazon SageMaker Model Deployment
Amazon SageMaker makes it easy to deploy ML models to make predictions (also known as inference) at the best price-performance for any use case. It provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. It is a fully managed service and integrates with MLOps tools, so you can scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden. From low latency (a few milliseconds) and high throughput (hundreds of thousands of requests per second) to long-running inference for use cases such as natural language processing and computer vision, you can use Amazon SageMaker for all your inference needs.
Learn more