README
Sources
The underlying model files this model uses
Qwen3 VL 8B The latest generation vision-language model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and video understanding.
Key Features Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion Architecture Highlights 8.77B parameters Interleaved-MRoPE for enhanced video reasoning DeepStack for fine-grained detail capture Text-Timestamp Alignment for precise event localization Context length: 256,000 tokens Vision-enabled multimodal model Performance Delivers strong vision-language performance across diverse tasks including document analysis, visual question answering, video understanding, and agentic interactions.
Based on
GGUF