Model

qwen2.5-vl-32b

Public

Use cases

Vision Input

Minimum system memory

19GB

Tags

32B
qwen2vl

README

Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct is a vision-language model for understanding images, text, and video. It supports structured outputs, visual localization, and temporal reasoning, making it suitable for tasks such as object recognition, chart analysis, and extracting structured data from visual content.

The model is designed for applications in document analysis, event detection, and agentic tool use. Outputs include bounding boxes, points, and JSON-formatted structured data.

Sources

The underlying model files this model uses