ChatPaper.aiChatPaper

LightOnOCR:一款实现顶尖OCR性能的10亿参数端到端多语言视觉语言模型 (注:1B在技术语境中通常指10亿参数,此处采用中文技术文献常用表述;"State-of-the-Art"译为"顶尖性能"更符合中文技术表达习惯;模型名称保留英文原名以保持专业性)

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

January 20, 2026
作者: Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin
cs.AI

摘要

我们推出LightOnOCR-2-1B——这是一个拥有10亿参数的多模态端到端多语言模型,能够直接将文档图像(如PDF)转换为整洁、自然排序的文本,无需依赖脆弱的OCR流程。该模型通过大规模高质量蒸馏训练数据(全面覆盖扫描文档、法语文档和科学类PDF)进行训练,在OlmOCR-Bench基准测试中达到最先进性能,其模型体积比先前最佳性能模型缩小9倍且推理速度显著提升。我们进一步扩展输出格式以预测嵌入式图像的归一化边界框,通过恢复策略在预训练阶段引入定位能力,并采用基于交并比奖励的强化学习视觉推理进行优化。最后,通过检查点平均和任务算术融合技术增强模型鲁棒性。本模型检查点基于Apache 2.0协议开源,相关数据集及LightOnOCR-bbox-bench评估基准将依据各自许可协议公开释放。
English
We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.
PDF51January 22, 2026