A library for accelerating Transformer models on NVIDIA GPUs
LM Studio Apple MLX engine
A real time inference engine for temporal logical specifications
High-performance reactive message-passing based Bayesian engine
A high-throughput and memory-efficient inference and serving engine
Jlama is a modern LLM inference engine for Java
A high-performance inference engine for AI models
950 line, minimal, extensible LLM inference engine built from scratch
lightweight, standalone C++ inference engine for Google's Gemma models
Alibaba's high-performance LLM inference engine for diverse apps
A lightweight vLLM implementation built from scratch
High-performance inference framework for large language models
Code for running inference and finetuning with SAM 3 model
Mooncake is the serving platform for Kimi
Pruna is a model optimization framework built for developers
Fast Multimodal LLM on Mobile Devices
RGBD video generation model conditioned on camera input
Offline inference engine for art, real-time voice conversations
Run GGUF models easily with a KoboldAI UI
Fast, flexible LLM inference
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
Inference Llama 2 in one file of pure C
Run a 1-billion parameter LLM on a $10 board with 256MB RAM
Parallax is a distributed model serving framework
LightLLM is a Python-based LLM (Large Language Model) inference