README

Qwen3 VL 4B

The latest generation vision-language model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and video understanding.

Key Features

Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion

Architecture Highlights

4.44B parameters
Interleaved-MRoPE for enhanced video reasoning
DeepStack for fine-grained detail capture
Text-Timestamp Alignment for precise event localization
Context length: 256,000 tokens
Vision-enabled multimodal model

Performance

Delivers strong vision-language performance across diverse tasks including document analysis, visual question answering, video understanding, and agentic interactions.

Parameters

Custom configuration options included with this model

Repeat Penalty

Disabled

Temperature

0.7

Top K Sampling

20

Top P Sampling

0.8

Sources

The underlying model files this model uses