ERNIE-4.5-VL-424B-A47B-Base-PT is a powerful multimodal Mixture-of-Experts (MoE) model developed by Baidu and fine-tuned for enhanced performance across both text and visual tasks. It builds upon the pretraining of ERNIE 4.5, using modality-specific post-training techniques to optimize for general-purpose natural language processing and visual-language reasoning. The model employs a heterogeneous MoE architecture with modality-isolated routing and loss-balancing mechanisms to ensure efficient and specialized expert activation. With a total of 424 billion parameters—47 billion of which are active per token—it supports large context windows and deep cross-modal understanding. Key training strategies include FP8 mixed precision, fine-grained recomputation, and advanced quantization methods for efficient inference. It supports both “thinking” and “non-thinking” visual modes, allowing it to handle a range of tasks from pure text generation to image-aware reasoning.
Features
- Multimodal model supporting both text and image inputs
- 424B parameters with 47B activated per token
- Fine-tuned for cross-modal comprehension and generation
- Heterogeneous MoE architecture with modality-isolated routing
- Trained using FP8 mixed precision and hybrid parallelism
- Optimized for long context length up to 131,072 tokens
- Supports supervised, DPO, and UPO post-training techniques
- Apache 2.0 license for commercial and research use