Download Latest Version v0.11.0.post1 source code.tar.gz (2.6 MB)
Email in envelope

Get an email when there's a new version of Axolotl

Home / v0.11.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-07-09 10.9 kB
v0.11.0 source code.tar.gz 2025-07-09 2.5 MB
v0.11.0 source code.zip 2025-07-09 3.0 MB
Totals: 3 Items   5.5 MB 0

🚨 Breaking Changes

Upstream Patches for CCE, Phi3, Phi4

Our Cut-Cross-Entropy (CCE) patches have been moved to a dedicated upstream fork. This improves maintainability, enables the community to easily contribute new patches, and re-use across projects. This update includes:

  • Updates to support transformers>=4.52.4 .
  • New patches for phi3 phi4_multimodal .
  • All patches have been sanity-tested for reliability.

Please make sure to install from our fork instead. We recommend using the provided script in the repo

:::yaml
python scripts/cutcrossentropy_install.py | sh
  • Contributed by @NanoCode012 in #2813.

Support for PyTorch 2.5.1 is dropped as PyTorch 2.8.0 is slated to be released later this month. We recommend using torch==2.7.0 or 2.7.1.

Docker images now default to use torch 2.7.1 when using main-latest tags.

vLLM is no longer included in Docker images for torch==2.6.0. This is due to vllm wheels using the incorrect ABI for 2.6.0 and the last supported version of vLLM for torch 2.6.0 is 0.8.5.post1. See https://github.com/vllm-project/vllm/issues/13608 for more details. Similarly, vLLM is only included in torch==2.7.0 as it is pinned to that particular version and 2.7.1 support is still in review

🎉 New features

Added Chunked cross entropy loss

We've introduced chunked_cross_entropy as an alternative to the default trainer loss function. This can help reduce peak memory usage during training, especially for models with large vocabularies.

  • Contributed by @winglian in #2625.

Added Support for Falcon-h1

You can now fine-tune models from the Falcon-h1 family. Run one of the example configs.

  • Contributed by @younesbelkada in #2811.

Added Support for Devstral Small 2505

It is now possible to fine-tune Devstral models in Axolotl. Give it a try following our docs.

  • Contributed by @NanoCode012 in #2880.

TiledMLP support

TiledMLP, authored by Arctic Long Sequence Training, reduces the activation footprint of long sequences in the MLP modules.

This currently only works with DeepSpeed Zero1 through Zero3. Single GPU, DDP, and FSDP aren't supported with this currently. Enable it via tiled_mlp: true. Follow the linked PR for more info.

  • Contributed by @winglian in #2811.

DenseMixer integration

DenseMixer is a MoE post-training method that improves router gradient estimation in MoE training. Read our docs learn more.

  • Contributed by @winglian in #2868.

Flexible Evaluation Sequence Length

You can now set a different eval_sequence_len in your config. This allows you to train with one sequence length but run evaluations on a longer or shorter one, providing more flexibility for testing model capabilities.

  • Contributed by @winglian in #2836.

Improved Merge LoRA on CPU for DPO

--lora-on-cpu flag now correctly moved LoRA adapters to CPU, even for DPO. This is useful for saving VRAM when merging LoRA adapters on machines with limited GPU memory.

  • Contributed by @kallewoof in #2766.

Other Feature Enhancements

  • Log Configuration on Startup: Axolotl now logs the full, resolved configuration at the start of every run, making it much easier to verify your settings. (by @djsaunde in #2819)
  • chat_template kwargs: Restored the ability to pass additional arguments to your chat templates for more flexible formatting. (by @NanoCode012 in #2837)
  • Support Jinja2 template paths to chat_template_jinja and re-formatting string templates to files (by @winglian in #2795)

📦 Dependency Updates

  • flash-attn upgraded to 2.8.0.post2. (by @winglian in #2828)
  • accelerate upgraded to 1.8.1 and bitsandbytes to 0.46.0. (by @winglian in #2815)
  • mistral-common upgraded to 1.6.3 to fix multiprocessing pickling issues. (by @NanoCode012 in #2790)
  • transformers upgraded to 4.53.1. (by @winglian in #2844)

🔧 Major fixes

Reduced Startup Time for Sample Packing

Due to a regression in a prior PR, the trainer took longer to start if packing was enabled due to starting too many new processes.

  • Contributed by @winglian in #2830.

Distributed Training Fixes

  • DeepSpeed Initialization: Resolved an issue where DeepSpeed would fail to initialize correctly after a recent refactor. (by @djsaunde in #2820)
  • Sequence Parallelism VRAM: Addressed a high VRAM usage issue when using Sequence Parallelism with RL trainers. (by @djsaunde in #2829)
  • FSDP / Device Mesh: Ensured that device mesh patching is correctly applied for FSDP training. (by @djsaunde in #2842)

Iterable Dataset Fixes

Resolved critical bugs affecting iterable datasets, improving their stability and usability:

  • Fixed pickling errors that would prevent training from resuming. (by @winglian in #2831)
  • Fixed a failure during preprocessing when sampling from an iterable dataset. (by @winglian in #2825)

Tokenization Stall Fixes

Addressed tokenization stall with single long datasets that resulted in tokenization taking hours. (by @NanoCode012 in #2845)

General Stability Fixes

  • Gemma3: Re-added the Gemma3 loss patch that was inadvertently removed, fixing training for these models. (#2817 by @NanoCode012)
  • Train Sampler: Fixed a 'NoneType' object has no attribute 'column_names' error that could occur with the train data sampler. (#2822 by @NanoCode012)
  • Packing: Added an assertion to the packing patch to prevent silent failures. (by @winglian in #2840)

Other Improvements

New Contributors

Full Changelog: https://github.com/axolotl-ai-cloud/axolotl/compare/v0.10.0...v0.11.0

Source: README.md, updated 2025-07-09