ERNIE-4.5-VL-28B-A3B-Base-PT is a large-scale multimodal Mixture-of-Experts (MoE) model developed by Baidu, featuring 28 billion total parameters and 3 billion activated per token. It is pretrained to handle both text and image inputs, enabling it to excel in image-to-text and conversational AI tasks. The model uses a staged training strategy—starting with text-only training and then integrating vision components using ViT, adapters, and visual experts for robust cross-modal understanding. A heterogeneous MoE design, combined with advanced routing techniques and token-balancing strategies, ensures high efficiency and minimal interference between modalities. It is built on PaddlePaddle and includes innovations like intra-node parallelism, FP8 mixed precision, and 2/4-bit quantization for efficient inference. This PT (pretrained) version is suited for further fine-tuning on downstream multimodal tasks. The model supports English and Chinese and is released under the Apache 2.0 license.
Features
- Pretrained multimodal MoE model with 28B parameters
- 3B activated parameters per token for efficient inference
- Supports both text and vision inputs with 64 text and 64 vision experts
- Staged training for stable multimodal learning
- Long context window up to 131,072 tokens
- Built with PaddlePaddle and supports Transformer inference
- Includes visual experts and adapters for image processing
- Commercial-use friendly under Apache 2.0 license