377 Downloads
README
Qwen2.5-VL-32B-Instruct is a vision-language model for understanding images, text, and video. It supports structured outputs, visual localization, and temporal reasoning, making it suitable for tasks such as object recognition, chart analysis, and extracting structured data from visual content.
The model is designed for applications in document analysis, event detection, and agentic tool use. Outputs include bounding boxes, points, and JSON-formatted structured data.
Sources
The underlying model files this model uses
Based on