ERNIE-4.5-VL-424B-A47B-Base-Paddle is a multimodal Mixture-of-Experts (MoE) model developed by Baidu, designed to understand and generate both text and image-based information. It utilizes a heterogeneous MoE architecture with modality-isolated routing and specialized loss functions to ensure effective learning across both modalities. Pretrained with trillions of tokens, the model activates 47B parameters per token out of a total of 424B, optimizing for scalability and precision. Its training incorporates a staged approach, first focusing on language, then extending to vision with additional modules like ViT and visual experts. The model supports extremely long contexts (up to 131,072 tokens), enabling complex reasoning and narrative generation. Built on the PaddlePaddle framework, it leverages FP8 mixed precision, hybrid parallelism, and quantization techniques for efficient performance.
Features
- 424B total parameters with 47B activated per token
- Trained for both language and visual understanding
- Multimodal heterogeneous MoE architecture
- Supports ultra-long context length (131,072 tokens)
- Includes modality-specific experts and visual adapters
- Trained using FP8 mixed precision and efficient pipeline scheduling
- Optimized for cross-modal reasoning and generation
- Built with PaddlePaddle for wide hardware compatibility