ERNIE-4.5-VL-28B-A3B-Base-Paddle is a multimodal Mixture-of-Experts (MoE) model designed to understand and generate content from both text and images. With 28 billion total parameters and 3 billion activated per token, it strikes a balance between performance and efficiency. It leverages a heterogeneous MoE architecture with modality-isolated routing and token-balanced losses to avoid cross-modality interference. The model undergoes staged pretraining: first focusing on textual understanding, then incorporating visual capabilities using Vision Transformers, adapters, and dedicated visual experts. It supports context lengths up to 131,072 tokens, making it suitable for long-form reasoning and image-text interactions. Built on PaddlePaddle and pretrained on trillions of tokens, it is optimized for conversational, generative, and reasoning tasks. The model supports English and Chinese and is released under the Apache 2.0 license.
Features
- Multimodal support for text and vision tasks
- 28B total parameters with 3B activated per token
- 64 text and 64 vision experts with 2 shared experts
- Staged training with dedicated visual and textual phases
- Long context window up to 131,072 tokens
- Supports English and Chinese
- Built on PaddlePaddle with scalable inference support
- Released under Apache 2.0 for commercial use