Qwen3 VL 32B
The latest generation vision-language model in the Qwen series with comprehensive upgrades to visual perception and spatial reasoning.
Key Features
- Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
- Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
- Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
- Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
- Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
- Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion
Architecture Highlights
- 33B parameters
- Interleaved-MRoPE for enhanced video reasoning
- DeepStack for fine-grained detail capture
- Text-Timestamp Alignment for precise event localization
- Context length: 256,000 tokens
- Vision-enabled multimodal model
Delivers superior vision-language performance across diverse tasks including document analysis, visual question answering, and agentic interactions.