Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
vllm-0.9.2+cu118-cp38-abi3-manylinux1_x86_64.whl | 2025-07-08 | 243.4 MB | |
vllm-0.9.2+cu126-cp38-abi3-manylinux1_x86_64.whl | 2025-07-08 | 359.2 MB | |
vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl | 2025-07-08 | 383.4 MB | |
vllm-0.9.2.tar.gz | 2025-07-08 | 9.0 MB | |
README.md | 2025-07-06 | 65.6 kB | |
v0.9.2 source code.tar.gz | 2025-07-06 | 8.9 MB | |
v0.9.2 source code.zip | 2025-07-06 | 10.6 MB | |
Totals: 7 Items | 1.0 GB | 18 |
Highlights
This release contains 452 commits from 167 contributors (31 new!)
NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrate to V1 engine.
Engine Core
- Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
- Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, [#18581], [#19617], [#19501]).
- FlexAttention update – any head size, FP32 fallback (#20467, [#19754]).
- Shared
CachedRequestData
objects and cached sampler‑ID stores deliver perf enhancements (#20232, [#20291]).
Model Support
- New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, [#20297]), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, [#20134]), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
- Granite hybrid MoE configurations with shared experts are fully supported (#19652).
Large‑Scale Serving & Engine Improvements
- Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, [#19790], [#19885]).
- Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
- Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, [#20246]).
Hardware & Performance
- NVIDIA Blackwell
- SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, [#19566], [#20071], [#19794])
- SM100: block‑scaled‑group GEMM, INT8/FP8 vectorisation, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, [#19572], [#19168], [#19085], [#20290], [#20331]).
- Intel GPU (V1) backend with Flash‑Attention support (#19560).
- AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, [#19744], [#18596]).
- Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
- Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
- TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, [#20235], [#19620], [#19813], [#20048], [#20339]).
- Add models and features supporting matrix. (#20230)
Quantization
- Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
- Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, [#19990], [#19563]).
- Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorisation primitives (#19395, [#20331], [#19233]).
- Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, [#20033], [#19431], [#20076]).
API · CLI · Frontend
- API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
- New OpenAI‑compatible endpoints:
/v1/audio/translations
& revamped/v1/audio/transcriptions
(#19615, [#20179], [#19597]). - Token‑level progress bar for
LLM.beam_search
and cached template‑resolution speed‑ups (#19301, [#20065]). - Image‑object support in
llm.chat
, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, [#17177], [#16862]). - CLI QoL: better parsing for
-O/--compilation-config
, batch‑size‑sweep benchmarking, richer--help
, faster startup (#20156, [#20516], [#20430], [#19941]). - Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
Platform & Deployment
- No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
- Security hardening – runtime (cloud)pickle imports forbidden (#18018).
- Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, [#19336]).
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in https://github.com/vllm-project/vllm/pull/19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in https://github.com/vllm-project/vllm/pull/19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in https://github.com/vllm-project/vllm/pull/19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in https://github.com/vllm-project/vllm/pull/19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in https://github.com/vllm-project/vllm/pull/19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in https://github.com/vllm-project/vllm/pull/19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in https://github.com/vllm-project/vllm/pull/19451
- [CI] Disable failing GGUF model test by @mgoin in https://github.com/vllm-project/vllm/pull/19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in https://github.com/vllm-project/vllm/pull/19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in https://github.com/vllm-project/vllm/pull/19455
- Fix Typo in Documentation and Function Name by @leopardracer in https://github.com/vllm-project/vllm/pull/19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in https://github.com/vllm-project/vllm/pull/19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19474
- [Doc] Fix quantization link titles by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in https://github.com/vllm-project/vllm/pull/19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19471
- Add cache to cuda get_device_capability by @mgoin in https://github.com/vllm-project/vllm/pull/19436
- Fix some typo by @Ximingwang-09 in https://github.com/vllm-project/vllm/pull/19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in https://github.com/vllm-project/vllm/pull/19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in https://github.com/vllm-project/vllm/pull/19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in https://github.com/vllm-project/vllm/pull/19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in https://github.com/vllm-project/vllm/pull/19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in https://github.com/vllm-project/vllm/pull/19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in https://github.com/vllm-project/vllm/pull/19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in https://github.com/vllm-project/vllm/pull/17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in https://github.com/vllm-project/vllm/pull/19505
- [CI] change spell checker from codespell to typos by @andyxning in https://github.com/vllm-project/vllm/pull/18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in https://github.com/vllm-project/vllm/pull/19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in https://github.com/vllm-project/vllm/pull/19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in https://github.com/vllm-project/vllm/pull/19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in https://github.com/vllm-project/vllm/pull/19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in https://github.com/vllm-project/vllm/pull/19509
- Fix typo by @2niuhe in https://github.com/vllm-project/vllm/pull/19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in https://github.com/vllm-project/vllm/pull/18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in https://github.com/vllm-project/vllm/pull/19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in https://github.com/vllm-project/vllm/pull/19503
- [Quantization] Improve AWQ logic by @jeejeelee in https://github.com/vllm-project/vllm/pull/19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in https://github.com/vllm-project/vllm/pull/19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in https://github.com/vllm-project/vllm/pull/19233
- Fix TorchAOConfig skip layers by @mobicham in https://github.com/vllm-project/vllm/pull/19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-project/vllm/pull/16756
- [doc] Make top navigation sticky by @reidliu41 in https://github.com/vllm-project/vllm/pull/19540
- [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/18847
- [Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19506
- [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in https://github.com/vllm-project/vllm/pull/19452
- [Doc] Unify structured outputs examples by @aarnphm in https://github.com/vllm-project/vllm/pull/18196
- [V1] Resolve failed concurrent structred output requests by @russellb in https://github.com/vllm-project/vllm/pull/19565
- Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19378
- [BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19515
- [Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in https://github.com/vllm-project/vllm/pull/19570
- [Doc] uses absolute links for structured outputs by @aarnphm in https://github.com/vllm-project/vllm/pull/19582
- [doc] fix incorrect link by @reidliu41 in https://github.com/vllm-project/vllm/pull/19586
- [Misc] Correct broken docs link by @Zerohertz in https://github.com/vllm-project/vllm/pull/19553
- [CPU] Refine default config for the CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19539
- [Fix] bump mistral common to support magistral by @princepride in https://github.com/vllm-project/vllm/pull/19533
- [Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in https://github.com/vllm-project/vllm/pull/19549
- use base version for version comparison by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/19587
- [torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in https://github.com/vllm-project/vllm/pull/19064
- [BugFix] Honor
enable_caching
in connector-delayed kvcache load case by @njhill in https://github.com/vllm-project/vllm/pull/19435 - [Model] Fix minimax model cache & lm_head precision by @qscqesze in https://github.com/vllm-project/vllm/pull/19592
- [Refactor] Remove unused variables in
moe_permute_unpermute_kernel.inl
by @yewentao256 in https://github.com/vllm-project/vllm/pull/19573 - [doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in https://github.com/vllm-project/vllm/pull/19606
- [CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in https://github.com/vllm-project/vllm/pull/18581
- [Doc] Add troubleshooting section to k8s deployment by @annapendleton in https://github.com/vllm-project/vllm/pull/19377
- [torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19618
- Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in https://github.com/vllm-project/vllm/pull/19508
- [BugFix] Fix DP Coordinator incorrect debug log message by @njhill in https://github.com/vllm-project/vllm/pull/19624
- [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in https://github.com/vllm-project/vllm/pull/18354
- [Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in https://github.com/vllm-project/vllm/pull/19633
- [Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in https://github.com/vllm-project/vllm/pull/19593
- [Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in https://github.com/vllm-project/vllm/pull/19316
- [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in https://github.com/vllm-project/vllm/pull/19500
- Only build CUTLASS MoE kernels on Hopper by @huydhn in https://github.com/vllm-project/vllm/pull/19648
- [Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in https://github.com/vllm-project/vllm/pull/19561
- [Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in https://github.com/vllm-project/vllm/pull/19262
- [Perf] Further tunings for SM100 FP8 CUTLASS kernel by @ilmarkov in https://github.com/vllm-project/vllm/pull/19566
- [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness by @houseroad in https://github.com/vllm-project/vllm/pull/19644
- [Kernel] Raise verbose error and consolidate
num_heads/num_kv_heads
divisibility check by @22quinn in https://github.com/vllm-project/vllm/pull/19339 - [Benchmark] Refactor benchmark script for fp8 & int8 by @yewentao256 in https://github.com/vllm-project/vllm/pull/19627
- Enable prefix caching with full cuda graphs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19617
- [CI/Build] Fix torch nightly CI dependencies part 2 by @zou3519 in https://github.com/vllm-project/vllm/pull/19589
- [Misc] Remove duplicate multiproc method setting for CPU platform by @Isotr0py in https://github.com/vllm-project/vllm/pull/19649
- [MISC] Remove unused variableds in C++ by @houseroad in https://github.com/vllm-project/vllm/pull/19609
- [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker by @quanliu1991 in https://github.com/vllm-project/vllm/pull/18957
- [Misc][Frontend] passthrough
bad_words
by @f14-bertolotti in https://github.com/vllm-project/vllm/pull/19564 - [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19660
- [TPU] support attention head dim smaller than 128 by @yaochengji in https://github.com/vllm-project/vllm/pull/19620
- [MISC] typo fix by @andyxning in https://github.com/vllm-project/vllm/pull/19672
- [CI] Add mteb testing for rerank models by @noooop in https://github.com/vllm-project/vllm/pull/19344
- [Docs] Move multiproc doc to v1 dir by @russellb in https://github.com/vllm-project/vllm/pull/19651
- [Kernel] GGUF MMVQ kernel for multiple input vectors by @SzymonOzog in https://github.com/vllm-project/vllm/pull/18754
- [BugFix] Don't catch BaseException when dumping execute_model errors by @njhill in https://github.com/vllm-project/vllm/pull/19626
- [DOC] Add reasoning capability to vLLM streamlit code by @Navanit-git in https://github.com/vllm-project/vllm/pull/19557
- [Feature]:Allow for Granite MoE Hybrid models with only shared experts. by @shawntan in https://github.com/vllm-project/vllm/pull/19652
- [Bugfix] Fix TP inference for Flex attention backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/19657
- [MISC] bump huggingface_hub pkg to 0.33.0 by @andyxning in https://github.com/vllm-project/vllm/pull/19547
- [Bugfix] fix missing 'finish_reason': null in streaming chat by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/19662
- [Kernels] Use empty for modular MoE workspaces by @bnellnm in https://github.com/vllm-project/vllm/pull/19667
- [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) by @qscqesze in https://github.com/vllm-project/vllm/pull/19677
- [V1] Change return type on get_multimodal_embeddings() by @russellb in https://github.com/vllm-project/vllm/pull/19446
- [Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 by @dsikka in https://github.com/vllm-project/vllm/pull/19563
- [Fix] Fall back to Gloo when NCCL backend is unavailable by @conroy-cheers in https://github.com/vllm-project/vllm/pull/19641
- [doc] add project flag to gcloud TPU command by @davidxia in https://github.com/vllm-project/vllm/pull/19664
- [Wheel Size] Only build FA2 8.0+PTX by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/19336
- [Frontend] add chunking audio for > 30s audio by @nguyenhoangthuan99 in https://github.com/vllm-project/vllm/pull/19597
- [DOC] fix doc typos by @diliu0349 in https://github.com/vllm-project/vllm/pull/19600
- Fixes IMA for TP w/ flex-attention by @drisspg in https://github.com/vllm-project/vllm/pull/19712
- [Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager by @quanliu1991 in https://github.com/vllm-project/vllm/pull/19686
- [Doc] Add missing llava family multi-image examples by @Isotr0py in https://github.com/vllm-project/vllm/pull/19698
- Add a doc on how to update PyTorch version by @huydhn in https://github.com/vllm-project/vllm/pull/19705
- [Kernel] Add Split-KV Support to Unified Triton Attention Kernel by @jvlunteren in https://github.com/vllm-project/vllm/pull/19152
- [doc][mkdocs] Add edit button to documentation by @reidliu41 in https://github.com/vllm-project/vllm/pull/19637
- [doc] split "Other AI Accelerators" tabs by @davidxia in https://github.com/vllm-project/vllm/pull/19708
- [V1][Kernel] Flashinfer HND KV cache layout by @NickLucche in https://github.com/vllm-project/vllm/pull/19280
- [Mis] remove duplicate engine status checks by @googs1025 in https://github.com/vllm-project/vllm/pull/19647
- [Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 by @Isotr0py in https://github.com/vllm-project/vllm/pull/19151
- [Perf] Optimize
moe_align_block_size
CUDA kernel by @yewentao256 in https://github.com/vllm-project/vllm/pull/19572 - Remove sm120 arch from sm100 cutlass kernel arch list by @mgoin in https://github.com/vllm-project/vllm/pull/19716
- [Misc] Update lmcache connector with the latest connector apis by @YaoJiayi in https://github.com/vllm-project/vllm/pull/19441
- [Bugfix] Fix faulty triton importing logic when using Ray for DP by @mgoin in https://github.com/vllm-project/vllm/pull/19734
- [Feature][ROCm] Add full graph capture support for TritonAttentionBackend by @charlifu in https://github.com/vllm-project/vllm/pull/19158
- [TPU] Update torch version to include paged attention kernel change by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19706
- [MISC] correct copy_blocks src_to_dists param type by @andyxning in https://github.com/vllm-project/vllm/pull/19696
- [MISC] correct DeviceConfig device field static type analysis by @andyxning in https://github.com/vllm-project/vllm/pull/19699
- [Misc] Add str for RequestStatus by @lk-chen in https://github.com/vllm-project/vllm/pull/19780
- [V1] Add API docs for EncoderCacheManager by @russellb in https://github.com/vllm-project/vllm/pull/19294
- [V1][P/D] An native implementation of xPyD based on P2P NCCL by @Abatom in https://github.com/vllm-project/vllm/pull/18242
- [V1] Decouple GPU and TPU
InputBatch
by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19778 - [Minor] Zero-initialize attn output buffer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19784
- [doc] fix the incorrect label by @reidliu41 in https://github.com/vllm-project/vllm/pull/19787
- [Platform] Allow platform use V1 Engine by default by @wangxiyuan in https://github.com/vllm-project/vllm/pull/19792
- [Qwen] Add tagging rule for Qwen related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/19799
- [Hardware][AMD] integrate aiter chunked prefill into vllm by @Zzz9990 in https://github.com/vllm-project/vllm/pull/18596
- [Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/19725
- [Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc by @russellb in https://github.com/vllm-project/vllm/pull/19808
- [v1] Support mamba2 by @heheda12345 in https://github.com/vllm-project/vllm/pull/19327
- docs: fix Slack bulletpoint in README by @nathan-weinberg in https://github.com/vllm-project/vllm/pull/19811
- Disable "Forbid direct 'import triton'" check for
vllm/triton_utils/importing.py
in an extensible way by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19783 - [Core] Do not copy array during hashing by @lgeiger in https://github.com/vllm-project/vllm/pull/19484
- [TPU] Update torch-xla version to include paged attention tuned block change by @QiliangCui in https://github.com/vllm-project/vllm/pull/19813
- [Core] More fixes to MultiModalEmbeddings type handling by @russellb in https://github.com/vllm-project/vllm/pull/19715
- [Multimodal] Use fast processor for Qwen2/2.5-VL by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19789
- [BugFix] Fix use_cudagraph=False by @zou3519 in https://github.com/vllm-project/vllm/pull/19612
- [Frontend] Expose custom args in OpenAI APIs by @afeldman-nm in https://github.com/vllm-project/vllm/pull/16862
- Fix FA2 fallback for Blackwell V1 by @mgoin in https://github.com/vllm-project/vllm/pull/19781
- [Misc][ROCm] Enforce no unused variable in ROCm C++ files by @houseroad in https://github.com/vllm-project/vllm/pull/19796
- [Quantization] Modify the logic of BNB double quantization by @jeejeelee in https://github.com/vllm-project/vllm/pull/19742
- Support embedding models in V1 by @maxdebayser in https://github.com/vllm-project/vllm/pull/16188
- [Bugfix] Fix the linter by @houseroad in https://github.com/vllm-project/vllm/pull/19826
- [Bugfix] Add check_health to v1 async client. by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19821
- Mark invariant normalizer in Gemma as non-persistent by @yhtang in https://github.com/vllm-project/vllm/pull/19788
- [ROCm] [AITER] [Bugfix] Patch for AITER commit
648764942e552a8bb5fe16026703716a81f05374
by @tjtanaa in https://github.com/vllm-project/vllm/pull/18990 - [Misc] [ROCm] Prevent surplus tensor reshape by @zsolt-borbely-htec in https://github.com/vllm-project/vllm/pull/19803
- raise exception for pin_lora by @andyxning in https://github.com/vllm-project/vllm/pull/19809
- [Minor] Allow redirecting model path for HfRunner in test by @Isotr0py in https://github.com/vllm-project/vllm/pull/19795
- Add xLAM tool parser support by @zuxin666 in https://github.com/vllm-project/vllm/pull/17148
- [Frontend] Add optional token-level progress bar to
LLM.beam_search
by @NekoMimiUnagi in https://github.com/vllm-project/vllm/pull/19301 - Fixing Chunked Prefill Test. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/19762
- [Doc] Update V1 user guide for embedding models by @22quinn in https://github.com/vllm-project/vllm/pull/19842
- [CI][CPU] Improve dummy Triton interfaces and fix the CPU CI by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19838
- [Core][Bugfix] Fix Online MM Beam Search by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/19688
- [Frontend] early return chat format resolution when specified by @xzbdmw in https://github.com/vllm-project/vllm/pull/19735
- [Benchmark][Bugfix] Fix Dataset Length Calculation by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/19868
- [CI/Build][Bugfix] Fix deadlock on v1 engine test CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/19872
- [CI][Neuron] Fail and exit on first error by @elaineyz in https://github.com/vllm-project/vllm/pull/19622
- [Benchmark] Fix
Value of type "SampleRequest" is not indexable
by @b8zhong in https://github.com/vllm-project/vllm/pull/18032 - [Chore]: qwen3-moe-type-hints-mistake by @Xerxes-cn in https://github.com/vllm-project/vllm/pull/19860
- [Bugfix] Enable PP with AITER+V1 by @qli88 in https://github.com/vllm-project/vllm/pull/19822
- [Bugfix][Ray] Set the cuda context eagerly in the ray worker by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19583
- [Misc] update cuda version by @reidliu41 in https://github.com/vllm-project/vllm/pull/19526
- [Misc] refactor example - openai_transcription_client by @reidliu41 in https://github.com/vllm-project/vllm/pull/19851
- [Kernel] correct cpu worker function parameter type by @andyxning in https://github.com/vllm-project/vllm/pull/19745
- [Fix] import regex instead of re by @tdoublep in https://github.com/vllm-project/vllm/pull/19875
- [Model] GPT2ForSequenceClassification model by @nie3e in https://github.com/vllm-project/vllm/pull/19663
- [custom_op][vllm-plugin] update custom_op class to use op_registry by @xuechendi in https://github.com/vllm-project/vllm/pull/19164
- Export NaNs in logits to scheduler_stats if output is corrupted by @vladmihailescu in https://github.com/vllm-project/vllm/pull/18777
- [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19901
- [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError by @andyxning in https://github.com/vllm-project/vllm/pull/19749
- [Misc] Clean up useless code by @wangxiyuan in https://github.com/vllm-project/vllm/pull/19889
- Fix: Check the type of params to be a Sequence not list. by @rabinadk1 in https://github.com/vllm-project/vllm/pull/19910
- [Bugfix] Fix bnb 8bit model weights loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/19917
- [New model support]Support Tarsier2 by @princepride in https://github.com/vllm-project/vllm/pull/19887
- [doc] add contact us in community by @reidliu41 in https://github.com/vllm-project/vllm/pull/19922
- [Multimodal] Optimize Qwen2/2.5-VL startup time by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19756
- [Docs] Add GPT2ForSequenceClassification to supported models in docs by @nie3e in https://github.com/vllm-project/vllm/pull/19932
- [Misc] add vllm_config in init by @andyxning in https://github.com/vllm-project/vllm/pull/19866
- [MISC] add cpu_kvcache_space_bytes to CacheConfig by @andyxning in https://github.com/vllm-project/vllm/pull/19812
- [Benchmark] fix request loss if "ping" is returned by @sywangyi in https://github.com/vllm-project/vllm/pull/19535
- [CI/Build] Auto tag perf benchmarks related PRs by @22quinn in https://github.com/vllm-project/vllm/pull/19943
- [doc] use snippets for contact us by @reidliu41 in https://github.com/vllm-project/vllm/pull/19944
- [Misc] Update model-specific PR tagging by @ywang96 in https://github.com/vllm-project/vllm/pull/19949
- [Misc] Simplify vllm bench cli subcommand implementation by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19948
- [Chore] dedup logs by @aarnphm in https://github.com/vllm-project/vllm/pull/19955
- [BugFix] Add an env to disable moe chunking to work around compile incompatibility by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19642
- [Perf][CLI] Improve overall startup time by @aarnphm in https://github.com/vllm-project/vllm/pull/19941
- [Core] feat: Implement Priority Scheduling in V1 Engine by @amitm02 in https://github.com/vllm-project/vllm/pull/19057
- [Misc] Configurable timeout for execute_model RPC calls via env var by @jinqinn in https://github.com/vllm-project/vllm/pull/19544
- Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor by @Flink-ddd in https://github.com/vllm-project/vllm/pull/19643
- [doc] Fold long code blocks to improve readability by @reidliu41 in https://github.com/vllm-project/vllm/pull/19926
- [P/D][NixlConnector] Support
tp_size > num_kv_heads
deployments by @NickLucche in https://github.com/vllm-project/vllm/pull/19691 - [BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when all transfer done by @lk-chen in https://github.com/vllm-project/vllm/pull/19874
- [Doc] Update V1 status for decoder-only embedding models by @Isotr0py in https://github.com/vllm-project/vllm/pull/19952
- [doc] use MkDocs collapsible blocks - supplement by @reidliu41 in https://github.com/vllm-project/vllm/pull/19973
- [Bugfix] Fix CI bitsandbytes failure by @jeejeelee in https://github.com/vllm-project/vllm/pull/19969
- [doc] improve readability for long commands by @reidliu41 in https://github.com/vllm-project/vllm/pull/19920
- [Docs] Fix syntax highlighting of shell commands by @lgeiger in https://github.com/vllm-project/vllm/pull/19870
- [EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/19885
- [Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/19956
- [Misc] Add type alias
ReqId
andEngineId
for better readability by @lk-chen in https://github.com/vllm-project/vllm/pull/19880 - [Feature] Support sequence parallelism for static fp8 quantization by @cascade812 in https://github.com/vllm-project/vllm/pull/19181
- [CI/Build] Push latest tag for cpu and neuron docker image by @22quinn in https://github.com/vllm-project/vllm/pull/19897
- Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend by @Jun-Howie in https://github.com/vllm-project/vllm/pull/19395
- [Bugfix][Benchmark] Fix Marlin benchmark by @22quinn in https://github.com/vllm-project/vllm/pull/19929
- [TPU] Fix tpu model runner test by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19995
- Update test case parameter to have the throughput above 8.0 by @QiliangCui in https://github.com/vllm-project/vllm/pull/19994
- [Misc][Tools][Benchmark] Add profile to autotune script by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19711
- [doc] Fix broken link in the installation for CPU by @yankay in https://github.com/vllm-project/vllm/pull/19980
- add some examples for other benchmark scripts by @reidliu41 in https://github.com/vllm-project/vllm/pull/19893
- [PERF] Speedup of MRoPE prepare inputs by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/19939
- [Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20014
- refactor example - qwen3_reranker by @reidliu41 in https://github.com/vllm-project/vllm/pull/19847
- [Fix][V1] Remove --scheduling-policy oracle by @amitm02 in https://github.com/vllm-project/vllm/pull/20010
- [Perf] Improve/Fix-regression for FA3 in High QPS regimes by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/19463
- [Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. by @dtransposed in https://github.com/vllm-project/vllm/pull/19423
- [BugFix] Fix multi-node offline data parallel by @njhill in https://github.com/vllm-project/vllm/pull/19937
- [P/D] Asynchronously do _nixl_handshake by @lk-chen in https://github.com/vllm-project/vllm/pull/19836
- [Feature] Integrate new deepgemm by @yewentao256 in https://github.com/vllm-project/vllm/pull/19820
- [Easy] Remove submodule added in [#19463] by @b8zhong in https://github.com/vllm-project/vllm/pull/20039
- use .dev for version comparison with pytorch nightly release by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/20031
- cmake: Update vllm_flash_attn for vllm_kernels by @seemethere in https://github.com/vllm-project/vllm/pull/20032
- [Llama4] Update
attn_temperature_tuning
by @b8zhong in https://github.com/vllm-project/vllm/pull/19997 - Revert "[Feature] Integrate new deepgemm (#19820)" by @yewentao256 in https://github.com/vllm-project/vllm/pull/20049
- Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" by @Isotr0py in https://github.com/vllm-project/vllm/pull/20030
- Move to a faster base64 implementation by @h-avsha in https://github.com/vllm-project/vllm/pull/19984
- [Frontend] speed up import time of vllm.config by @davidxia in https://github.com/vllm-project/vllm/pull/18036
- [Refactor] Remove duplicate
ceil_div
by @yewentao256 in https://github.com/vllm-project/vllm/pull/20023 - [Feat][CLI] enforce-include-usage by @max-wittig in https://github.com/vllm-project/vllm/pull/19695
- [Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. by @bnellnm in https://github.com/vllm-project/vllm/pull/19717
- [Chore] debloat some initial logs by @aarnphm in https://github.com/vllm-project/vllm/pull/19438
- [BugFix] Fix full-cuda-graph illegal memory access in FA3 by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20057
- [doc] add reference link for Intel XPU by @reidliu41 in https://github.com/vllm-project/vllm/pull/20064
- [Doc] Guide for Incremental Compilation Workflow by @mgoin in https://github.com/vllm-project/vllm/pull/19109
- [V1][Speculative Decoding] Fix DeepSeek MTP by @cjackal in https://github.com/vllm-project/vllm/pull/20022
- [Frontend] Add
/v1/audio/translations
OpenAI API endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/19615 - [Quantization] Add compressed-tensors emulations support for NVFP4 by @dsikka in https://github.com/vllm-project/vllm/pull/19879
- [Fix] Support cls pooling in ModernBertPooler by @lsz05 in https://github.com/vllm-project/vllm/pull/20067
- static_scaled_fp8_quant should not run when scale.numel is not 1 by @eldarkurtic in https://github.com/vllm-project/vllm/pull/20076
- [PD] let toy proxy handle /chat/completions by @lk-chen in https://github.com/vllm-project/vllm/pull/19730
- [Misc] Add parallel state
node_count
function by @njhill in https://github.com/vllm-project/vllm/pull/20045 - Fix the path to the testing script. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20082
- [Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine by @izhuhaoran in https://github.com/vllm-project/vllm/pull/20062
- [TPU][Bugfix] fix kv cache padding by @yaochengji in https://github.com/vllm-project/vllm/pull/20048
- [P/D] Avoid stranding blocks in P when aborted in D's waiting queue by @njhill in https://github.com/vllm-project/vllm/pull/19223
- [TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19919
- [CI] Add SM120 to the Dockerfile by @mgoin in https://github.com/vllm-project/vllm/pull/19794
- [Bugfix] Fix Mistral tool-parser regex for nested JSON by @mgoin in https://github.com/vllm-project/vllm/pull/20093
- [PD] Skip
tp_size
exchange with rank0 by @NickLucche in https://github.com/vllm-project/vllm/pull/19413 - [Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/20083
- [Bugfix] Allow
CUDA_VISIBLE_DEVICES=''
inPlatform.device_id_to_physical_device_id
by @eicherseiji in https://github.com/vllm-project/vllm/pull/18979 - [Doc] Update docs for New Model Implementation by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20115
- [Refactor] Remove unused library by @yewentao256 in https://github.com/vllm-project/vllm/pull/20099
- [CPU] Fix torch version in x86 CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19258
- [Misc] Use collapsible blocks for benchmark examples. by @reidliu41 in https://github.com/vllm-project/vllm/pull/20017
- [Docs] Improve frameworks/helm.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20113
- [Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) by @tjtanaa in https://github.com/vllm-project/vllm/pull/19904
- Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" by @mgoin in https://github.com/vllm-project/vllm/pull/20128
- [Bug Fix] Fix address/port already in use error for pplx test by @yewentao256 in https://github.com/vllm-project/vllm/pull/20094
- [Doc] Automatically signed-off by PyCharm by @noooop in https://github.com/vllm-project/vllm/pull/20120
- [Doc] Auto sign-off for VSCode by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20132
- [Doc] Rename page titles by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20130
- Spam folks if config.py changes by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20131
- [Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. by @jikunshang in https://github.com/vllm-project/vllm/pull/19560
- [TPU] add kv cache update kernel by @yaochengji in https://github.com/vllm-project/vllm/pull/19928
- [Refactor] Rename commnication utils by @yewentao256 in https://github.com/vllm-project/vllm/pull/20091
- [Doc] correct LoRA capitalization by @kyolebu in https://github.com/vllm-project/vllm/pull/20135
- [Feature] Expert Parallelism Load Balancer (EPLB) by @abmfy in https://github.com/vllm-project/vllm/pull/18343
- [CI Failure] Fix OOM with test_oot_registration_embedding by @mgoin in https://github.com/vllm-project/vllm/pull/20144
- [Quantization] Bump to use latest
compressed-tensors
by @dsikka in https://github.com/vllm-project/vllm/pull/20033 - [Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler by @ilmarkov in https://github.com/vllm-project/vllm/pull/20071
- [Bugfix] Build moe_data for both sm100 and sm90 by @mgoin in https://github.com/vllm-project/vllm/pull/20086
- [Feature][Rocm] add quick all reduce for rocm by @lihaoyang-amd in https://github.com/vllm-project/vllm/pull/19744
- [CI] Sync test dependency with test.in for torch nightly by @yangw-dev in https://github.com/vllm-project/vllm/pull/19632
- [Fix] Fix gemma CI test failing on main by @tdoublep in https://github.com/vllm-project/vllm/pull/20124
- [Model][1/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/20012
- [Perf][Frontend]: eliminate api_key and x_request_id headers middleware overhead by @Yazan-Sharaya in https://github.com/vllm-project/vllm/pull/19946
- Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn by @xuechendi in https://github.com/vllm-project/vllm/pull/20143
- Gemma3n (Text-only) by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/20134
- [Bugfix] Fix flaky failure when getting DP ports by @mgoin in https://github.com/vllm-project/vllm/pull/20151
- [Perf][Frontend] Cached resolution for resolving chat templates by @ilyal-cerebras in https://github.com/vllm-project/vllm/pull/20065
- [Fix][ROCm] Remove unused variables to fix build error on GFX11/12 by @hyoon1 in https://github.com/vllm-project/vllm/pull/19891
- [Fix][torch.compile] Enable custom ops by default when Inductor off by @ProExpertProg in https://github.com/vllm-project/vllm/pull/20102
- [Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. by @bnellnm in https://github.com/vllm-project/vllm/pull/20152
- [Bugfix] Fix some narrowing conversion warnings by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20141
- [CI/Build] Allow hermetic builds by @fabiendupont in https://github.com/vllm-project/vllm/pull/18064
- [CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes by @mgoin in https://github.com/vllm-project/vllm/pull/20199
- [Misc] Add type assertion of request_id for LLMEngine.add_request by @SHA-4096 in https://github.com/vllm-project/vllm/pull/19700
- Fix num_token_padding support for static per-tensor scaled_fp8_quant by @mgoin in https://github.com/vllm-project/vllm/pull/20188
- fix ci issue distributed 4 gpu test by @yewentao256 in https://github.com/vllm-project/vllm/pull/20204
- [Bugfix] Properly reject requests with empty list guided_choice by @mgoin in https://github.com/vllm-project/vllm/pull/20195
- [BugFix] Fix the incorrect func name in the comments. (config.py) by @1195343015 in https://github.com/vllm-project/vllm/pull/20185
- [CI/Build] Add new CI job to validate Hybrid Models for every PR by @tdoublep in https://github.com/vllm-project/vllm/pull/20147
- [Frontend] Generalize
v1/audio/transcriptions
endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/20179 - [Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution by @s3woz in https://github.com/vllm-project/vllm/pull/20137
- [Refactor] Create a function util and cache the results for
has_deepgemm
,has_deepep
,has_pplx
by @yewentao256 in https://github.com/vllm-project/vllm/pull/20187 - [CI Fix] Try fixing eagle e2e test OOM by reducing block allocation by @mgoin in https://github.com/vllm-project/vllm/pull/20213
- [Quantization] Add compressed-tensors NVFP4 MoE Support by @dsikka in https://github.com/vllm-project/vllm/pull/19990
- Fix cuda_archs_loose_intersection when handling sm_*a by @huydhn in https://github.com/vllm-project/vllm/pull/20207
- [Model] support dots1 by @redmoe-moutain in https://github.com/vllm-project/vllm/pull/18254
- [BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert by @xuechendi in https://github.com/vllm-project/vllm/pull/20202
- [Misc] Fix import by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20233
- [doc] Add Slack and Forum to the top navigation by @reidliu41 in https://github.com/vllm-project/vllm/pull/20208
- [Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model by @noiji in https://github.com/vllm-project/vllm/pull/19598
- [Bugfix] Fix processor initialization in transformers 4.53.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/20244
- [Quantization] Improve BitsAndBytesModelLoader by @jeejeelee in https://github.com/vllm-project/vllm/pull/20242
- [Docs] Fix 1-2-3 list in v1/prefix_caching.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20243
- [Bugfix] fix quark ptpc by @lihaoyang-amd in https://github.com/vllm-project/vllm/pull/20251
- [Spec Decode] Refactor spec decoding into a separate function by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20238
- [Spec Decode] Clean up spec decode example by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20240
- [Optimization] Use Shared
CachedRequestData
Instance Across All Requests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20232 - [Unit Test] Add unit test for deep gemm by @yewentao256 in https://github.com/vllm-project/vllm/pull/20090
- [Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models by @kylesayrs in https://github.com/vllm-project/vllm/pull/20058
- [Refactor] Remove useless pdb comment by @yewentao256 in https://github.com/vllm-project/vllm/pull/20266
- [Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector by @Abatom in https://github.com/vllm-project/vllm/pull/20263
- [CLI] Improve CLI arg parsing for
-O
/--compilation-config
by @ProExpertProg in https://github.com/vllm-project/vllm/pull/20156 - [Bugfix] Fix include prompt in stream response when echo=true by @fyuan1316 in https://github.com/vllm-project/vllm/pull/15233
- [Misc] Fix spec decode example by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20296
- [Example] add one-click runnable example for P2P NCCL XpYd by @KuntaiDu in https://github.com/vllm-project/vllm/pull/20246
- [CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test by @xuechendi in https://github.com/vllm-project/vllm/pull/20196
- [Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA by @chewong in https://github.com/vllm-project/vllm/pull/15897
- [Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference by @sakogan in https://github.com/vllm-project/vllm/pull/18768
- [V1] Only print cudagraph tqdm on rank 0 with
is_global_first_rank
by @mgoin in https://github.com/vllm-project/vllm/pull/19516 - Fix
numel()
downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 by @r-barnes in https://github.com/vllm-project/vllm/pull/17082 - [Misc] add xgrammar for arm64 by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/18359
- Enable ZP Support for Machete by @czhu-cohere in https://github.com/vllm-project/vllm/pull/20268
- [CPU] Update custom ops for the CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20255
- [Bugfix] Fix deepep tests by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20288
- [Misc] remove redundant char by @kebe7jun in https://github.com/vllm-project/vllm/pull/20287
- [BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine by @tywuAMD in https://github.com/vllm-project/vllm/pull/19067
- [doc] fix the incorrect logo in dark mode by @reidliu41 in https://github.com/vllm-project/vllm/pull/20289
- [Perf] Validate @config in pre-commit instead of dynamically by @lionelvillard in https://github.com/vllm-project/vllm/pull/20200
- [Quant] [Bugfix] Fix quantization config matching with
hf_to_vllm_mapper
by @kylesayrs in https://github.com/vllm-project/vllm/pull/20046 - [Misc] Minor refactor of NIXL background handshake by @NickLucche in https://github.com/vllm-project/vllm/pull/20068
- Add GLM4.1V model (Draft) by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/19331
- [Model]Add Tencent HunYuanMoEV1 Model Support by @aiyiwang2025 in https://github.com/vllm-project/vllm/pull/20114
- [Misc] Minor refactoring for scheduler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20299
- [Docs] Update transcriptions API to use openai client with
stream=True
by @NickLucche in https://github.com/vllm-project/vllm/pull/20271 - [CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20301
- [Frontend] Expand tools even if tool_choice="none" by @okdshin in https://github.com/vllm-project/vllm/pull/17177
- [V1] [ROCm] Enable EP with AITER Fused MoE by @tjtanaa in https://github.com/vllm-project/vllm/pull/20270
- [Optimization] Cache sampled token ids in model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20291
- remove unused variables in marlin_template.h by @zhoutianzi666 in https://github.com/vllm-project/vllm/pull/20236
- [Refactor] Refactor import utils by @yewentao256 in https://github.com/vllm-project/vllm/pull/20269
- Enable group size 64 for Machete by @czhu-cohere in https://github.com/vllm-project/vllm/pull/20290
- [Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20324
- [UT][intel GPU] use current_platform instead of device hardcode in v1 tests by @Liangliang-Ma in https://github.com/vllm-project/vllm/pull/20169
- [Refactor] Remove duplicate
find_free_port
by @yewentao256 in https://github.com/vllm-project/vllm/pull/20333 - [Refactor] Remove Unused Env
VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON
by @yewentao256 in https://github.com/vllm-project/vllm/pull/20334 - [Misc][Doc] Add missing comment for LLM by @draftbk in https://github.com/vllm-project/vllm/pull/20285
- [FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter by @jikunshang in https://github.com/vllm-project/vllm/pull/20348
- [Bugfix] Fix dynamic rotary embedding by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20343
- fix[Docs]: link anchor is incorrect [#20309] by @yyzxw in https://github.com/vllm-project/vllm/pull/20315
- [Doc][TPU] Add models and features supporting matrix. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20230
- [TPU] kv cache update kernel supports dynamic grid by @yaochengji in https://github.com/vllm-project/vllm/pull/20235
- [Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. by @huachenheli in https://github.com/vllm-project/vllm/pull/20105
- [Model][VLM] Support Keye-VL-8B-Preview by @Kwai-Keye in https://github.com/vllm-project/vllm/pull/20126
- [Bugfix] Keye-VL compatibility with
tok_kwargs
(#20058) by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20353 - [Docs] Fix indentations for 2-level items in deprecation_policy.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20352
- [Docs] Make TPU ref prettier in google_tpu.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20356
- [Model] Add Ernie4.5 and Ernie4.5MoE Model Support by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/20220
- [Build/CI] Automatically tag DeepSeek related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/20370
- [NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) by @kaln27 in https://github.com/vllm-project/vllm/pull/17280
- [Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models by @huaqiangwang in https://github.com/vllm-project/vllm/pull/20322
- [Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct by @zichongli5 in https://github.com/vllm-project/vllm/pull/20286
- Documentation update tool_calling: mapping back to function from response by @cronoik-inceptionai in https://github.com/vllm-project/vllm/pull/20373
- [Kernels] MoE refactor by @bnellnm in https://github.com/vllm-project/vllm/pull/19636
- [V1] LogitsProcessor programming model by @afeldman-nm in https://github.com/vllm-project/vllm/pull/16728
- [Minor] Clean up incorrect comment in test by @njhill in https://github.com/vllm-project/vllm/pull/20382
- [Misc] add handler HF_TOKEN is emptry string by @lengrongfu in https://github.com/vllm-project/vllm/pull/20369
- [ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) by @vllmellm in https://github.com/vllm-project/vllm/pull/20254
- [DP] Support external DP Load Balancer mode by @njhill in https://github.com/vllm-project/vllm/pull/19790
- [Docs] Update EAGLE example by @NickLucche in https://github.com/vllm-project/vllm/pull/20375
- [Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20136
- [BugFix] Fix DP headless mode arg validation by @njhill in https://github.com/vllm-project/vllm/pull/20398
- Enable CPU nightly performance benchmark and its Markdown report by @louie-tsai in https://github.com/vllm-project/vllm/pull/18444
- [Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py by @bnellnm in https://github.com/vllm-project/vllm/pull/20381
- [Misc] Small: Fix video loader return type annotations. by @huachenheli in https://github.com/vllm-project/vllm/pull/20389
- [Bugfix][CI/CD][CPU] Fix CPU CI tests by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20383
- [TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 by @QiliangCui in https://github.com/vllm-project/vllm/pull/20385
- [Feature] Support MiniMax-M1 function calls features by @qscqesze in https://github.com/vllm-project/vllm/pull/20297
- [Tests] Update online DP tests to verify that requests are balanced by @njhill in https://github.com/vllm-project/vllm/pull/20157
- [Misc] Add rules to label Speculative Decoding Related PRs by @draftbk in https://github.com/vllm-project/vllm/pull/20406
- [doc] fix link by @reidliu41 in https://github.com/vllm-project/vllm/pull/20417
- [Docs] Replace two list with tables in intel_gaudi.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20414
- [Core] Move multimodal placeholder from chat utils to model definition by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20355
- [Kernel] refactor cpu worker v0 cache dtype by @andyxning in https://github.com/vllm-project/vllm/pull/20080
- [CI/Build][CPU] Enable cross compilation in CPU release pipeline by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20423
- [Quantization] Bump to use latest bitsandbytes by @jeejeelee in https://github.com/vllm-project/vllm/pull/20424
- [Model][2/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/19978
- [Misc] Automatically tag PRs to add new models by @Isotr0py in https://github.com/vllm-project/vllm/pull/20222
- [Frontend] improve vllm bench <bench_type> --help display by @reidliu41 in https://github.com/vllm-project/vllm/pull/20430
- [Bugfix] Fix flaky
test_streaming_response
test by @NickLucche in https://github.com/vllm-project/vllm/pull/20363 - [Frontend] fix duplicate output for bench subcmd by @reidliu41 in https://github.com/vllm-project/vllm/pull/20446
- [CI] Trimming some failing test groups from AMDPRODUCTION. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/20390
- [Misc] Clean up InternVL family config registration by @Isotr0py in https://github.com/vllm-project/vllm/pull/19992
- [Misc] adjust for ipv6 for mookcacke url parse by @andyxning in https://github.com/vllm-project/vllm/pull/20107
- [Misc] Remove _maybe_ignore_quant_config from GLM4.1v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/20432
- [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864
- [Misc] Fix
Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400 - [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510
- Change warn_for_unimplemented_methods to debug by @mgoin in https://github.com/vllm-project/vllm/pull/20455
- [Platform] Add custom default max tokens by @gmarinho2 in https://github.com/vllm-project/vllm/pull/18557
- Add ignore consolidated file in mistral example code by @princepride in https://github.com/vllm-project/vllm/pull/20420
- [Misc] small update by @reidliu41 in https://github.com/vllm-project/vllm/pull/20462
- [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in https://github.com/vllm-project/vllm/pull/20365
- [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in https://github.com/vllm-project/vllm/pull/20331
- [Misc] Add SPDX-FileCopyrightText by @jeejeelee in https://github.com/vllm-project/vllm/pull/20428
- Support Llama 4 for fused_marlin_moe by @mgoin in https://github.com/vllm-project/vllm/pull/20457
- [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in https://github.com/vllm-project/vllm/pull/18809
- [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/20168
- [Doc] Fix classification table in list of supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20489
- [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in https://github.com/vllm-project/vllm/pull/18193
- [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in https://github.com/vllm-project/vllm/pull/20395
- Enable V1 for Hybrid SSM/Attention Models by @tdoublep in https://github.com/vllm-project/vllm/pull/20016
- [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in https://github.com/vllm-project/vllm/pull/19757
- [CI Bugfix] Fix pre-commit failures on main by @mgoin in https://github.com/vllm-project/vllm/pull/20502
- [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in https://github.com/vllm-project/vllm/pull/20497
- [Misc] Add security warning for development mode endpoints by @reidliu41 in https://github.com/vllm-project/vllm/pull/20508
- [doc] small fix by @reidliu41 in https://github.com/vllm-project/vllm/pull/20506
- [Misc] Remove the unused LoRA test code by @jeejeelee in https://github.com/vllm-project/vllm/pull/20494
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in https://github.com/vllm-project/vllm/pull/20507
- [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in https://github.com/vllm-project/vllm/pull/19754
- [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in https://github.com/vllm-project/vllm/pull/20510
- [Misc] remove unused import by @reidliu41 in https://github.com/vllm-project/vllm/pull/20517
- test_attention compat with coming xformers change by @bottler in https://github.com/vllm-project/vllm/pull/20487
- [BUG] Fix [#20484]. Support empty sequence in cuda penalty kernel by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/20491
- [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in https://github.com/vllm-project/vllm/pull/20509
- [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20513
- [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in https://github.com/vllm-project/vllm/pull/20339
- [Frontend] Support image object in llm.chat by @sfeng33 in https://github.com/vllm-project/vllm/pull/19635
- [Benchmark] Add support for multiple batch size benchmark through CLI in
benchmark_moe.py
+ Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in https://github.com/vllm-project/vllm/pull/20516 - [Misc] call the pre-defined func by @reidliu41 in https://github.com/vllm-project/vllm/pull/20518
- [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20412
- [V1] Support any head size for FlexAttention backend by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20467
- [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20530
- [Bugfix] Add
use_cross_encoder
flag to use correct activation inClassifierPooler
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20527
New Contributors
- @py-andy-c made their first contribution in https://github.com/vllm-project/vllm/pull/19399
- @2niuhe made their first contribution in https://github.com/vllm-project/vllm/pull/19394
- @leopardracer made their first contribution in https://github.com/vllm-project/vllm/pull/19442
- @artetaout made their first contribution in https://github.com/vllm-project/vllm/pull/19085
- @runzhen made their first contribution in https://github.com/vllm-project/vllm/pull/19453
- @strutive07 made their first contribution in https://github.com/vllm-project/vllm/pull/19522
- @yewentao256 made their first contribution in https://github.com/vllm-project/vllm/pull/19233
- @mobicham made their first contribution in https://github.com/vllm-project/vllm/pull/19265
- @kouroshHakha made their first contribution in https://github.com/vllm-project/vllm/pull/19378
- @BoyuanFeng made their first contribution in https://github.com/vllm-project/vllm/pull/19587
- @sahelib25 made their first contribution in https://github.com/vllm-project/vllm/pull/18354
- @jiahanc made their first contribution in https://github.com/vllm-project/vllm/pull/19500
- @quanliu1991 made their first contribution in https://github.com/vllm-project/vllm/pull/18957
- @f14-bertolotti made their first contribution in https://github.com/vllm-project/vllm/pull/19564
- @Navanit-git made their first contribution in https://github.com/vllm-project/vllm/pull/19557
- @nguyenhoangthuan99 made their first contribution in https://github.com/vllm-project/vllm/pull/19597
- @diliu0349 made their first contribution in https://github.com/vllm-project/vllm/pull/19600
- @Zzz9990 made their first contribution in https://github.com/vllm-project/vllm/pull/18596
- @yhtang made their first contribution in https://github.com/vllm-project/vllm/pull/19788
- @zsolt-borbely-htec made their first contribution in https://github.com/vllm-project/vllm/pull/19803
- @zuxin666 made their first contribution in https://github.com/vllm-project/vllm/pull/17148
- @NekoMimiUnagi made their first contribution in https://github.com/vllm-project/vllm/pull/19301
- @xzbdmw made their first contribution in https://github.com/vllm-project/vllm/pull/19735
- @Xerxes-cn made their first contribution in https://github.com/vllm-project/vllm/pull/19860
- @nie3e made their first contribution in https://github.com/vllm-project/vllm/pull/19663
- @vladmihailescu made their first contribution in https://github.com/vllm-project/vllm/pull/18777
- @rabinadk1 made their first contribution in https://github.com/vllm-project/vllm/pull/19910
- @amitm02 made their first contribution in https://github.com/vllm-project/vllm/pull/19057
- @jinqinn made their first contribution in https://github.com/vllm-project/vllm/pull/19544
- @Flink-ddd made their first contribution in https://github.com/vllm-project/vllm/pull/19643
- @Jun-Howie made their first contribution in https://github.com/vllm-project/vllm/pull/19395
- @seemethere made their first contribution in https://github.com/vllm-project/vllm/pull/20032
- @h-avsha made their first contribution in https://github.com/vllm-project/vllm/pull/19984
- @max-wittig made their first contribution in https://github.com/vllm-project/vllm/pull/19695
- @lsz05 made their first contribution in https://github.com/vllm-project/vllm/pull/20067
- @kyolebu made their first contribution in https://github.com/vllm-project/vllm/pull/20135
- @lihaoyang-amd made their first contribution in https://github.com/vllm-project/vllm/pull/19744
- @Yazan-Sharaya made their first contribution in https://github.com/vllm-project/vllm/pull/19946
- @ilyal-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/20065
- @fabiendupont made their first contribution in https://github.com/vllm-project/vllm/pull/18064
- @SHA-4096 made their first contribution in https://github.com/vllm-project/vllm/pull/19700
- @1195343015 made their first contribution in https://github.com/vllm-project/vllm/pull/20185
- @redmoe-moutain made their first contribution in https://github.com/vllm-project/vllm/pull/18254
- @noiji made their first contribution in https://github.com/vllm-project/vllm/pull/19598
- @chewong made their first contribution in https://github.com/vllm-project/vllm/pull/15897
- @sakogan made their first contribution in https://github.com/vllm-project/vllm/pull/18768
- @czhu-cohere made their first contribution in https://github.com/vllm-project/vllm/pull/20268
- @aiyiwang2025 made their first contribution in https://github.com/vllm-project/vllm/pull/20114
- @okdshin made their first contribution in https://github.com/vllm-project/vllm/pull/17177
- @zhoutianzi666 made their first contribution in https://github.com/vllm-project/vllm/pull/20236
- @yyzxw made their first contribution in https://github.com/vllm-project/vllm/pull/20315
- @Kwai-Keye made their first contribution in https://github.com/vllm-project/vllm/pull/20126
- @CSWYF3634076 made their first contribution in https://github.com/vllm-project/vllm/pull/20220
- @kaln27 made their first contribution in https://github.com/vllm-project/vllm/pull/17280
- @huaqiangwang made their first contribution in https://github.com/vllm-project/vllm/pull/20322
- @zichongli5 made their first contribution in https://github.com/vllm-project/vllm/pull/20286
- @cronoik-inceptionai made their first contribution in https://github.com/vllm-project/vllm/pull/20373
- @sangbumlikeagod made their first contribution in https://github.com/vllm-project/vllm/pull/18809
- @djmmoss made their first contribution in https://github.com/vllm-project/vllm/pull/19757
- @GuyStone made their first contribution in https://github.com/vllm-project/vllm/pull/20497
- @bottler made their first contribution in https://github.com/vllm-project/vllm/pull/20487
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.9.1...v0.9.2