The latest generation vision-language model in the Qwen series with comprehensive upgrades to visual perception and spatial reasoning.
Key Features
Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
Enhanced Reasoning: Excels at STEM and math with causal analysis and evidence-based answers
Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion
Architecture Highlights
2B parameters
DeepStack for fine-grained detail capture
Text-Timestamp Alignment for precise event localization
Context length: 256,000 tokens
Vision-enabled multimodal model
Performance
Delivers strong vision-language performance across diverse tasks including document analysis, visual question answering, and agentic interactions. Optimized for edge deployment with efficient inference on Apple Silicon via MLX quantization.
Parameters
Custom configuration options included with this model