ERNIE-4.5-300B-A47B-2Bits-Paddle is a 2-bit quantized variant of Baidu’s 300B-parameter Mixture-of-Experts (MoE) language model, designed for ultra-low-resource inference. Despite the extreme compression, the model retains 47 billion active parameters per token and supports high-quality language generation across English and Chinese. Built with PaddlePaddle and optimized for deployment on a single 141GB GPU, it uses sophisticated quantization (WINT2) and expert-parallel collaboration to achieve lossless performance. The model supports a context length of up to 131,072 tokens and integrates with FastDeploy for fast service setup. Like other ERNIE 4.5 models, it benefits from pretraining and modality-specific post-training via SFT, DPO, and UPO methods. It is especially suited for applications requiring high throughput and minimal latency with limited hardware. Users are advised to use temperature 0.8 and top-p 0.8 for optimal sampling.
Features
- 2-bit quantized weights for minimal memory usage
- 300B total parameters with 47B active per token
- Supports deployment on a single 141GB GPU
- Long context window of up to 131,072 tokens
- Expert parallelism and load balancing for scalable performance
- Multilingual text generation (English and Chinese)
- Integrated with FastDeploy for quick inference setup
- Open-source under Apache 2.0 license