| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2024-02-16 | 6.7 kB | |
| v0.7.0_ Marlin int4_fp16 kernel, AWQ checkpoints loading source code.tar.gz | 2024-02-16 | 7.5 MB | |
| v0.7.0_ Marlin int4_fp16 kernel, AWQ checkpoints loading source code.zip | 2024-02-16 | 7.6 MB | |
| Totals: 3 Items | 15.0 MB | 0 | |
Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading
@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4fp16 matrix multiplication, with per-group symmetric quantization* support (without act-order), which significantly outperforms other existing kernels when using batching.
This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:
:::python
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.
A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark
Visual tables coming soon.
- add marlin kernel by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/514
- updated marlin serialization by @rib-2 in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
- Marlin repacking CUDA kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/539
- Marlin kernel can be built against any compute capability by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/540
Ability to load AWQ checkpoints in AutoGPTQ
Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.
AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).
Example:
:::python
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.
- Support inference with AWQ models by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/484
Qwen2, LongLLaMA, Deci_lm models support
These models can be quantized with AutoGPTQ.
- Add qwen2 by @JustinLin610 in https://github.com/AutoGPTQ/AutoGPTQ/pull/519
- Change deci_lm model type to deci by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/491
- Support for LongLLaMA models. by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/442
Other changes and bugfixes
- Update version & install instructions by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/485
- fix the support of Qwen by @hzhwcmhf in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
- rocm6.0 compatible exllama by @seungrokj in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
- Untie weights for safetensors serialization by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/536
- marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/524
- Use ruff for linting by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/537
- Fix wheels build for torch==2.2.0 by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/541
- Fix repo owners in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/542
- Disable peft compatibility by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/543
- Improve README by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/544
- Add ROCm dockerfile by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/545
- Make all tests pass by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/546
- Fix cuda wheel build workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/547
- Use bash in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/548
- Dissociate Windows & Linux CUDA build by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/549* Add more guards on compute capability in Marlin kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/550
New Contributors
- @hzhwcmhf made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
- @rib-2 made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
- @seungrokj made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0