238 Downloads
README
Qwen2.5-VL-72B-Instruct is a vision-language model that processes images, text, and video, supporting structured outputs and visual localization. It is capable of temporal reasoning and can extract structured data from visual content, including charts and layouts.
Intended uses include document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted data.
Sources
The underlying model files this model uses
Based on