olmOCR 2:文件光學字符識別單元測試獎勵
olmOCR 2: Unit Test Rewards for Document OCR
October 22, 2025
作者: Jake Poznanski, Luca Soldaini, Kyle Lo
cs.AI
摘要
我们隆重推出olmOCR 2,这是我们家族中最新一代强大的光学字符识别(OCR)系统,专为将数字化印刷文档(如PDF)转换为整洁、自然顺序的纯文本而设计。olmOCR 2的核心动力源自olmOCR-2-7B-1025,这是一款经过强化学习与可验证奖励(RLVR)训练的专业化7B视觉语言模型(VLM),其中我们的奖励机制基于一系列多样化的二元单元测试。为了扩大单元测试的创建规模,我们开发了一套流程,用于生成具有多样性和挑战性布局的合成文档,这些文档附有已知的真实HTML源代码及提取的测试案例。我们证明,在这些测试案例上进行RL训练,使得olmOCR 2在olmOCR-Bench——我们的英语OCR基准测试中,达到了业界领先的性能,尤其是在数学公式转换、表格解析以及多栏布局处理方面,相较于前代版本实现了最大幅度的提升。我们以宽松的开源许可发布了我们的模型、数据及代码。
English
We present olmOCR 2, the latest in our family of powerful OCR systems for
converting digitized print documents, like PDFs, into clean, naturally ordered
plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision
language model (VLM) trained using reinforcement learning with verifiable
rewards (RLVR), where our rewards are a diverse set of binary unit tests. To
scale unit test creation, we develop a pipeline for generating synthetic
documents with diverse and challenging layouts, known ground-truth HTML source
code, and extracted test cases. We show that RL training on these test cases
results in state-of-the-art performance on olmOCR-Bench, our English-language
OCR benchmark, with the largest improvements in math formula conversion, table
parsing, and multi-column layouts compared to previous versions. We release our
model, data and code under permissive open licenses.