Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is a vision-language model capable of understanding images, text, and video. It supports structured outputs, visual localization, and can process long videos with temporal reasoning. The model is suitable for tasks involving object recognition, chart and layout analysis, and extracting structured data from visual content.

This model is designed for practical vision-language applications, including document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted structured data.

qwen2.5-vl-3b

Qwen2.5-VL-3B-Instruct