LightOnOCR：一款实现顶尖OCR性能的10亿参数端到端多语言视觉语言模型（注：1B在技术语境中通常指10亿参数，此处采用中文技术文献常用表述；"State-of-the-Art"译为"顶尖性能"更符合中文技术表达习惯；模型名称保留英文原名以保持专业性）

摘要

我们推出LightOnOCR-2-1B——这是一个拥有10亿参数的多模态端到端多语言模型，能够直接将文档图像（如PDF）转换为整洁、自然排序的文本，无需依赖脆弱的OCR流程。该模型通过大规模高质量蒸馏训练数据（全面覆盖扫描文档、法语文档和科学类PDF）进行训练，在OlmOCR-Bench基准测试中达到最先进性能，其模型体积比先前最佳性能模型缩小9倍且推理速度显著提升。我们进一步扩展输出格式以预测嵌入式图像的归一化边界框，通过恢复策略在预训练阶段引入定位能力，并采用基于交并比奖励的强化学习视觉推理进行优化。最后，通过检查点平均和任务算术融合技术增强模型鲁棒性。本模型检查点基于Apache 2.0协议开源，相关数据集及LightOnOCR-bbox-bench评估基准将依据各自许可协议公开释放。

English

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

摘要

Support