January 28, 2024
DeepSeek Releases Groundbreaking Open-Source Multimodal Model
Last modified: October 20, 2018
Janus-Pro: Leading the Revolution in Multimodal AI Technology
A major breakthrough in artificial intelligence! On January 28, 2024, DeepSeek made waves with the release of their new open-source multimodal AI model, Janus-Pro, on the Hugging Face platform. This innovative model comes in two parameter sizes - 1B and 7B, with Janus-Pro-7B demonstrating exceptional performance across multiple authoritative benchmarks, successfully surpassing mainstream models including OpenAI's DALL-E 3 and Stable Diffusion.
As an advanced system integrating both image understanding and generation capabilities, Janus-Pro employs SigLIP-L-based image comprehension technology and leverages LlamaGen for high-quality image generation. Notably, even OpenAI's GPT-4 has not yet made its image generation capabilities publicly available.
The Path of Technical Evolution
DeepSeek has shown strong technical innovation capabilities in the multimodal AI field through continuous development:
In early 2024, the team first introduced the basic version of Janus, a breakthrough open-source multimodal model (MLLM). Its innovation lay in achieving visual encoding decoupling, paving new ways for multimodal understanding and generation. This version was built on DeepSeek-LLM-1.3b-base and trained on approximately 500B text tokens. For image processing, the model uses SigLIP-L as its visual encoder, supporting 384x384 resolution image inputs and employing a 16x downsampling rate for image generation.
In November 2024, DeepSeek released JanusFlow, a revolutionary unified image generation framework. Its uniqueness lies in its minimalist design philosophy, cleverly combining autoregressive language models with advanced generative model methods. Research has shown that this framework can be trained directly within existing large language model architectures without requiring complex structural modifications.
Janus-Pro: Technical Breakthroughs and Innovation
At the beginning of 2025, the Janus series welcomed a major upgrade with the official release of Janus-Pro. This version inherits the advantages of the original architecture while introducing multiple innovations:
Core Technical Features:
- Innovative autoregressive framework design
- Visual encoding path decoupling technology
- Unified transformer architecture
- Flexible multimodal processing capabilities
Technical Implementation Details:
- Visual encoding parsing as a core design principle
- Independent encoding method for raw input processing
- Unified autoregressive Transformer architecture
- Innovative feature processing flow:
- Using siglip encoder for image semantic feature extraction
- 2D to 1D feature conversion
- Intelligent feature mapping and adaptation
- Application of VQ token technology
- Multimodal feature sequence fusion processing
Performance and Evaluation
Janus-Pro's training process was extremely rigorous, utilizing the PyTorch-based HAI-LLM framework. Specific training configurations:
- 1.5B version: 16-node cluster, 7-day training cycle
- 7B version: 32-node cluster, 14-day training cycle
- Hardware configuration: 8 A100 (40GB) GPUs per node
In authoritative benchmark tests, Janus-Pro demonstrated excellent performance:
-
Multimodal Understanding Capability:
- MMBench test score: 79.2
- Significantly surpassing:
- Janus (69.4)
- TokenFlow (68.9)
- Morph (75.2)
-
Image Generation Capability:
- GenEval score: 0.80
- Superior to:
- Janus (0.61)
- DALL-E 3 (0.67)
- Stable Diffusion 3 Medium (0.74)
-
Instruction Following Capability:
- GenEval overall accuracy: 80%
- Exceeding competitors:
- Transfusion (63%)
- SD3-Medium (74%)
- DALL-E 3 (67%)
- DPG-Bench score: 84.19 (leading all benchmark solutions)
Open Source and Future Prospects
Currently, the complete code for Janus-Pro is available on GitHub: https://github.com/deepseek-ai/Janus
The industry widely believes that DeepSeek's breakthrough progress in the multimodal field, especially innovations in text-to-image and text-to-video directions, is likely to draw strong attention from tech giants including OpenAI and Meta.
Special Note: Due to recent large-scale network attacks, DeepSeek has temporarily restricted registration for non-mainland China phone numbers. Overseas users need to use virtual numbers for registration. Registered users are not affected and can continue to use platform services normally.