Model

qwen3-vl-32b

Public

Use cases

Vision Input

Minimum system memory

20GB

Tags

32B
qwen3_vl

README

Qwen3 VL 32B

The latest generation vision-language model in the Qwen series with comprehensive upgrades to visual perception and spatial reasoning.

Key Features

  • Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
  • Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
  • Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
  • Enhanced Reasoning: Excels at STEM and math with causal analysis and evidence-based answers
  • Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
  • Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
  • Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion

Architecture Highlights

  • 33B parameters
  • Interleaved-MRoPE for enhanced video reasoning
  • DeepStack for fine-grained detail capture
  • Text-Timestamp Alignment for precise event localization
  • Context length: 256,000 tokens
  • Vision-enabled multimodal model

Performance

Delivers superior vision-language performance across diverse tasks including document analysis, visual question answering, and agentic interactions. Suitable for deployment on Apple Silicon via MLX quantization.

Parameters

Custom configuration options included with this model

Repeat Penalty
Disabled
Temperature
0.7
Top K Sampling
20
Top P Sampling
0.8