The olmOCR 2 model is fine tuned from Qwen2.5-VL-7B-Instruct using the
olmOCR-mix-1025 dataset.
It has been trained on highly curated set academic papers, technical documentation, and other reference content. The model was fine-tuned on English documents using a multilingual base VLM; other languages may work.
This model expects as input a single document image, rendered such that the longest dimension is 1288 pixels. The prompt must then contain the additional metadata from the document, and the easiest way to generate this is to use the methods provided by the olmOCR toolkit.
Parameters
Custom configuration options included with this model