ERNIE-4.5-VL-28B-A3B-PT is a multimodal Mixture-of-Experts (MoE) model from Baidu, designed for sophisticated vision-language reasoning and generation. With 28 billion parameters (3 billion activated per token), it enables high-quality image-text interactions, supporting tasks like visual Q&A, description, and multimodal chain-of-thought. The model uses a heterogeneous MoE architecture with isolated routing and token-balanced training for optimized cross-modal representation. It features post-training enhancements through Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Unified Preference Optimization (UPO), along with Reinforcement Learning with Verifiable Rewards (RLVR). Built on PaddlePaddle and compatible with the Transformers library, it supports both thinking and non-thinking inference modes. It handles long contexts (up to 131,072 tokens) and is designed to scale across various hardware.
Features
- 28B total parameters with 3B activated per token
- Text and vision modality support with MoE routing
- Enables multimodal reasoning and chain-of-thought
- Includes RLVR, SFT, DPO, and UPO for alignment
- Transformers-compatible for easy deployment
- PaddlePaddle backend for high performance
- Supports thinking and non-thinking inference modes
- Long context handling up to 131,072 tokens