The olmOCR 2 model is a Vision Language Model (VLM) from Allen AI.
To run the smallest olmOCR 2, you need at least 5 GB of RAM.
olmOCR 2 models support vision input. They are available in gguf.

The olmOCR 2 model is a Vision Language Model (VLM) from Allen AI.
The model is a fine-tune of Qwen2.5-VL-7B-Instruct (by Alibaba Qwen) using the olmOCR-mix-1025 dataset.
It has been trained on highly curated set academic papers, technical documentation, and other reference content. The model was fine-tuned on English documents using a multilingual base VLM; other languages may work.
This model expects as input a single document image, rendered such that the longest dimension is 1288 pixels. The prompt must then contain the additional metadata from the document, and the easiest way to generate this is to use the methods provided by the olmOCR toolkit.
This model scores the following scores on olmOCR-bench when used with the olmOCR toolkit toolkit which automatically renders, rotates, and retries pages as needed.
| Model | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall |
|---|---|---|---|---|---|---|---|---|---|
| olmOCR pipeline v0.4.0 with olmOCR-2-7B-1025 | 82.9 | 82.1 | 84.3 | 48.3 | 95.7 | 84.3 | 81.4 | 99.7 | 82.3 ± 1.1 |
| olmOCR pipeline v0.4.0 with olmOCR-2-7B-1025-FP8 | 83.0 | 82.3 | 84.9 | 47.7 | 96.1 | 83.7 | 81.9 | 99.7 | 82.4 ± 1.1 |
This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.