← All Models

qwen2.5-vl

45.9K Downloads

Qwen2.5-VL is a performant vision-language model, capable of recognizing common objects and text. Supports context length of 128k tokens in a variety of human languages.

Models
Updated 2 days ago
2.15 GB
5.37 GB
19.35 GB
47.00 GB

Memory Requirements

To run the smallest qwen2.5-vl, you need at least 2 GB of RAM. The largest one may require up to 47 GB.

Capabilities

qwen2.5-vl models support vision input. They are available in gguf.

About qwen2.5-vl

undefined

Qwen2.5-VL is a vision-language model that supports context length of 128k tokens.

It is proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

Capable of acting as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

Useful for generating structured outputs and stable JSON outputs.

Key Enhancements over Qwen2-VL:

  • Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

  • Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

  • Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.

  • Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.

  • Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.

Model architecture

undefined

Performance

Image benchmark

BenchmarkInternVL2.5-8BMiniCPM-o 2.6GPT-4o-miniQwen2-VL-7BQwen2.5-VL-7B
MMMUval5650.46054.158.6
MMMU-Proval34.3-37.630.541.0
DocVQAtest9393-94.595.7
InfoVQAtest77.6--76.582.6
ChartQAtest84.8--83.087.3
TextVQAval79.180.1-84.384.9
OCRBench822852785845864
CC_OCR57.761.677.8
MMStar62.860.763.9
MMBench-V1.1-Entest79.478.076.080.782.6
MMT-Benchtest---63.763.6
MMStar61.557.554.860.763.9
MMVetGPT-4-Turbo54.260.066.962.067.1
HallBenchavg45.248.146.150.652.9
MathVistatestmini58.360.652.458.268.2
MathVision---16.325.07

Agent benchmark

BenchmarksQwen2.5-VL-7B
ScreenSpot84.7
ScreenSpot Pro29.0
AITZ_EM81.9
Android Control High_EM60.1
Android Control Low_EM93.7
AndroidWorld_SR25.5
MobileMiniWob++_SR91.4