Download Latest Version vllm-0.9.2.tar.gz (9.0 MB)
Email in envelope

Get an email when there's a new version of vLLM

Home / v0.9.2
Name Modified Size InfoDownloads / Week
Parent folder
vllm-0.9.2+cu118-cp38-abi3-manylinux1_x86_64.whl 2025-07-08 243.4 MB
vllm-0.9.2+cu126-cp38-abi3-manylinux1_x86_64.whl 2025-07-08 359.2 MB
vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl 2025-07-08 383.4 MB
vllm-0.9.2.tar.gz 2025-07-08 9.0 MB
README.md 2025-07-06 65.6 kB
v0.9.2 source code.tar.gz 2025-07-06 8.9 MB
v0.9.2 source code.zip 2025-07-06 10.6 MB
Totals: 7 Items   1.0 GB 18

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrate to V1 engine.

Engine Core

  • Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
  • Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, [#18581], [#19617], [#19501]).
  • FlexAttention update – any head size, FP32 fallback (#20467, [#19754]).
  • Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, [#20291]).

Model Support

  • New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, [#20297]), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, [#20134]), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
  • Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

  • Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, [#19790], [#19885]).
  • Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
  • Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, [#20246]).

Hardware & Performance

  • NVIDIA Blackwell
    • SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, [#19566], [#20071], [#19794])
    • SM100: block‑scaled‑group GEMM, INT8/FP8 vectorisation, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, [#19572], [#19168], [#19085], [#20290], [#20331]).
  • Intel GPU (V1) backend with Flash‑Attention support (#19560).
  • AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, [#19744], [#18596]).
    • Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
    • Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
  • TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, [#20235], [#19620], [#19813], [#20048], [#20339]).
    • Add models and features supporting matrix. (#20230)

Quantization

  • Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
  • Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, [#19990], [#19563]).
  • Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorisation primitives (#19395, [#20331], [#19233]).
  • Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, [#20033], [#19431], [#20076]).

API · CLI · Frontend

  • API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
  • New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, [#20179], [#19597]).
  • Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, [#20065]).
  • Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, [#17177], [#16862]).
  • CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, [#20516], [#20430], [#19941]).
  • Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

  • No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
  • Security hardening – runtime (cloud)pickle imports forbidden (#18018).
  • Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, [#19336]).

What's Changed

New Contributors

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.9.1...v0.9.2

Source: README.md, updated 2025-07-06