Model

qwen2.5-vl-3b

Public

Use cases

Vision Input

Minimum system memory

2GB

Tags

3B
qwen2vl

README

Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is a vision-language model capable of understanding images, text, and video. It supports structured outputs, visual localization, and can process long videos with temporal reasoning. The model is suitable for tasks involving object recognition, chart and layout analysis, and extracting structured data from visual content.

This model is designed for practical vision-language applications, including document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted structured data.

Sources

The underlying model files this model uses