Model

Qwen3-VL-30B

Public

The latest generation vision-language MoE model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and image understanding.

Use cases

Vision Input

Minimum system memory

18GB

Tags

30B
qwen3_vl_moe

README

Qwen3 VL 30B

The latest generation vision-language MoE model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and video understanding.

Key Features

  • Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
  • Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
  • Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
  • Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
  • Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
  • Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion
  • High-Efficiency MoE: 31.1B total parameters with only 3B activated (A3B) for excellent efficiency

Architecture Highlights

  • 31.1B total parameters (3B activated per token)
  • Mixture-of-Experts architecture
  • Interleaved-MRoPE for enhanced video reasoning
  • DeepStack for fine-grained detail capture
  • Text-Timestamp Alignment for precise event localization
  • Context length: 256,000 tokens
  • Vision-enabled multimodal MoE model

Performance

Delivers superior vision-language performance across diverse tasks including document analysis, visual question answering, video understanding, and agentic interactions. The MoE architecture provides excellent efficiency while maintaining high-quality outputs.

Parameters

Custom configuration options included with this model

Repeat Penalty
Disabled
Temperature
0.7
Top K Sampling
20
Top P Sampling
0.8