Model

qwen2.5-vl-72b

Public

Use cases

Vision Input

Minimum system memory

47GB

Tags

72B
qwen2vl

README

Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct is a vision-language model that processes images, text, and video, supporting structured outputs and visual localization. It is capable of temporal reasoning and can extract structured data from visual content, including charts and layouts.

Intended uses include document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted data.

Sources

The underlying model files this model uses