Janus-Pro: Leading the Revolution in Multimodal AI Technology

A major breakthrough in artificial intelligence! On January 28, 2024, DeepSeek made waves with the release of their new open-source multimodal AI model, Janus-Pro, on the Hugging Face platform. This innovative model comes in two parameter sizes - 1B and 7B, with Janus-Pro-7B demonstrating exceptional performance across multiple authoritative benchmarks, successfully surpassing mainstream models including OpenAI's DALL-E 3 and Stable Diffusion.

As an advanced system integrating both image understanding and generation capabilities, Janus-Pro employs SigLIP-L-based image comprehension technology and leverages LlamaGen for high-quality image generation. Notably, even OpenAI's GPT-4 has not yet made its image generation capabilities publicly available.

The Path of Technical Evolution

DeepSeek has shown strong technical innovation capabilities in the multimodal AI field through continuous development:

In early 2024, the team first introduced the basic version of Janus, a breakthrough open-source multimodal model (MLLM). Its innovation lay in achieving visual encoding decoupling, paving new ways for multimodal understanding and generation. This version was built on DeepSeek-LLM-1.3b-base and trained on approximately 500B text tokens. For image processing, the model uses SigLIP-L as its visual encoder, supporting 384x384 resolution image inputs and employing a 16x downsampling rate for image generation.

In November 2024, DeepSeek released JanusFlow, a revolutionary unified image generation framework. Its uniqueness lies in its minimalist design philosophy, cleverly combining autoregressive language models with advanced generative model methods. Research has shown that this framework can be trained directly within existing large language model architectures without requiring complex structural modifications.

Janus-Pro: Technical Breakthroughs and Innovation

At the beginning of 2025, the Janus series welcomed a major upgrade with the official release of Janus-Pro. This version inherits the advantages of the original architecture while introducing multiple innovations:

Core Technical Features:

Innovative autoregressive framework design
Visual encoding path decoupling technology
Unified transformer architecture
Flexible multimodal processing capabilities

Technical Implementation Details:

Visual encoding parsing as a core design principle
Independent encoding method for raw input processing
Unified autoregressive Transformer architecture
Innovative feature processing flow:
- Using siglip encoder for image semantic feature extraction
- 2D to 1D feature conversion
- Intelligent feature mapping and adaptation
- Application of VQ token technology
- Multimodal feature sequence fusion processing

Performance and Evaluation

Janus-Pro's training process was extremely rigorous, utilizing the PyTorch-based HAI-LLM framework. Specific training configurations:

1.5B version: 16-node cluster, 7-day training cycle
7B version: 32-node cluster, 14-day training cycle
Hardware configuration: 8 A100 (40GB) GPUs per node

In authoritative benchmark tests, Janus-Pro demonstrated excellent performance:

Multimodal Understanding Capability:
- MMBench test score: 79.2
- Significantly surpassing:
  - Janus (69.4)
  - TokenFlow (68.9)
  - Morph (75.2)
Image Generation Capability:
- GenEval score: 0.80
- Superior to:
  - Janus (0.61)
  - DALL-E 3 (0.67)
  - Stable Diffusion 3 Medium (0.74)
Instruction Following Capability:
- GenEval overall accuracy: 80%
- Exceeding competitors:
  - Transfusion (63%)
  - SD3-Medium (74%)
  - DALL-E 3 (67%)
- DPG-Bench score: 84.19 (leading all benchmark solutions)

Open Source and Future Prospects

Currently, the complete code for Janus-Pro is available on GitHub: https://github.com/deepseek-ai/Janus

The industry widely believes that DeepSeek's breakthrough progress in the multimodal field, especially innovations in text-to-image and text-to-video directions, is likely to draw strong attention from tech giants including OpenAI and Meta.

Special Note: Due to recent large-scale network attacks, DeepSeek has temporarily restricted registration for non-mainland China phone numbers. Overseas users need to use virtual numbers for registration. Registered users are not affected and can continue to use platform services normally.