ERNIE-4.5-VL-424B-A47B-PT is a large-scale multimodal MoE model developed by Baidu, integrating advanced capabilities in both language and vision. With 424 billion total parameters and 47 billion activated per token, it builds on ERNIE 4.5’s MoE foundation and introduces strong image-text interaction for complex reasoning and generation tasks. The model benefits from a structured post-training process including Supervised Fine-tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), enhancing its alignment and performance across diverse use cases. Designed to support both thinking and non-thinking inference modes, it enables flexible and interpretable outputs in real-world applications. Its heterogeneous MoE structure includes modality-isolated routing and token-balanced loss to ensure efficient joint training of text and visual components.
Features
- 424B total parameters with 47B activated per token
- Multimodal input: supports both vision and text tasks
- Post-trained with SFT and RLVR for improved alignment
- Switchable thinking mode for flexible reasoning depth
- Built with PaddlePaddle and supports FastDeploy
- Uses modality-isolated routing and balanced token loss
- Compatible with vLLM and supports 4-bit/8-bit quantization
- Handles long-context sequences up to 131,072 tokens