The first model in the Qwen3-Next series featuring innovative hybrid attention architecture and high-efficiency Mixture-of-Experts design.
Key Features
- Hybrid Attention: Combines Gated DeltaNet and Gated Attention for efficient ultra-long context modeling
- High-Sparsity MoE: 80B total parameters with only 3B activated, providing excellent efficiency
- Ultra-Long Context: Supports up to 262,144 tokens natively
- Multi-Token Prediction: Enhanced pretraining performance and faster inference
- Advanced Capabilities: Excels at reasoning, coding, creative writing, and agentic tasks
- Multilingual Support: Over 100 languages and dialects
Architecture Highlights
- 80B total parameters, 3B activated (A3B)
- 48 layers with hybrid layout
- 512 experts with only 10 activated per token
- Context length: 262,144 tokens
Delivers performance comparable to much larger models while maintaining exceptional efficiency:
- Outperforms Qwen3-32B with 10x inference throughput for long contexts
- Matches Qwen3-235B-A22B on many benchmarks with significantly lower computational requirements
- Superior ultra-long-context handling up to 256K+ tokens
Architecture
Details:
- Type: Causal Language Models
- Training Stage: Pretraining (15T tokens) & Post-training
- Number of Parameters: 80B in total and 3B activated
- Number of Paramaters (Non-Embedding): 79B
- Hidden Dimension: 2048
- Number of Layers: 48
- Hybrid Layout: 12 * (3 * (Gated DeltaNet → MoE) → 1 * (Gated Attention → MoE))
- Gated Attention:
- Number of Attention Heads: 16 for Q and 2 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
- Gated DeltaNet:
- Number of Linear Attention Heads: 32 for V and 16 for QK
- Head Dimension: 128
- Mixture of Experts:
- Number of Experts: 512
- Number of Activated Experts: 10
- Number of Shared Experts: 1
- Expert Intermediate Dimension: 512
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens