AutoGPTQ - Browse /v0.7.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2024-02-16	6.7 kB	0
v0.7.0_ Marlin int4_fp16 kernel, AWQ checkpoints loading source code.tar.gz	2024-02-16	7.5 MB	0
v0.7.0_ Marlin int4_fp16 kernel, AWQ checkpoints loading source code.zip	2024-02-16	7.6 MB	0
Totals: 3 Items		15.0 MB	0

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4fp16 matrix multiplication, with per-group symmetric quantization* support (without act-order), which significantly outperforms other existing kernels when using batching.

This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

:::python
import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

add marlin kernel by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/514
updated marlin serialization by @rib-2 in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
Marlin repacking CUDA kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/539
Marlin kernel can be built against any compute capability by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/540

Ability to load AWQ checkpoints in AutoGPTQ

Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.

AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

:::python
import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00,  1.18s/it]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

Support inference with AWQ models by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/484

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

Add qwen2 by @JustinLin610 in https://github.com/AutoGPTQ/AutoGPTQ/pull/519
Change deci_lm model type to deci by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/491
Support for LongLLaMA models. by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/442

Other changes and bugfixes

Update version & install instructions by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/485
fix the support of Qwen by @hzhwcmhf in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
rocm6.0 compatible exllama by @seungrokj in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
Untie weights for safetensors serialization by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/536
marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/524
Use ruff for linting by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/537
Fix wheels build for torch==2.2.0 by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/541
Fix repo owners in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/542
Disable peft compatibility by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/543
Improve README by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/544
Add ROCm dockerfile by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/545
Make all tests pass by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/546
Fix cuda wheel build workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/547
Use bash in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/548
Dissociate Windows & Linux CUDA build by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/549* Add more guards on compute capability in Marlin kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/550

New Contributors

@hzhwcmhf made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
@rib-2 made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
@seungrokj made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/515

Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0

Source: README.md, updated 2024-02-16

AutoGPTQ Files

An easy-to-use LLMs quantization package with user-friendly apis

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

Ability to load AWQ checkpoints in AutoGPTQ

Qwen2, LongLLaMA, Deci_lm models support

Other changes and bugfixes

New Contributors

AutoGPTQ Files

An easy-to-use LLMs quantization package with user-friendly apis

Get an email when there's a new version of AutoGPTQ

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

Ability to load AWQ checkpoints in AutoGPTQ

Qwen2, LongLLaMA, Deci_lm models support

Other changes and bugfixes

New Contributors