Janus-Pro-7B is a 7-billion-parameter autoregressive model developed by DeepSeek AI that unifies multimodal understanding and generation within a single transformer architecture. It introduces a decoupled visual encoding approach, separating the vision input pathways for understanding and generation, which improves model flexibility and avoids performance conflicts. For understanding tasks, it leverages the SigLIP-L vision encoder with 384x384 image resolution, while for generation, it uses a specialized image tokenizer with a downsampling rate of 16. Built on the DeepSeek-LLM 7B base, Janus-Pro achieves performance on par with or better than task-specific models across a wide range of vision-language tasks. This design enables seamless any-to-any functionality—such as text-to-image, image captioning, and visual question answering—under a unified framework. Janus-Pro is released under the MIT license and supports PyTorch-based multimodal applications.
Features
- 7B-parameter unified transformer for multimodal tasks
- Decouples vision encoding for understanding vs. generation
- Supports 384x384 image input via SigLIP-L encoder
- Enables text-to-image and image-to-text generation
- Built on DeepSeek-LLM 7B architecture
- Matches or surpasses task-specific models
- Licensed under MIT and available for commercial use
- Compatible with PyTorch and Hugging Face ecosystem