ChatPaper.aiChatPaper

LightOnOCR:一款用于顶尖OCR任务的10亿参数端到端多语言视觉语言模型

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

January 20, 2026
作者: Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin
cs.AI

摘要

我们推出LightOnOCR-2-1B——一个拥有10亿参数的多语言端到端视觉语言模型,可直接将文档图像(如PDF)转换为整洁、自然排序的文本,无需依赖脆弱的OCR流程。该模型通过大规模高质量蒸馏训练集进行训练,广泛涵盖扫描文档、法语文档和科学类PDF,在OlmOCR-Bench上实现了最先进的性能,其模型尺寸较先前最佳模型缩小9倍且推理速度显著提升。我们进一步扩展输出格式以预测嵌入式图像的归一化边界框,通过续训策略在预训练阶段引入定位能力,并采用基于交并比奖励的强化学习视觉推理进行优化。最后,通过检查点平均和任务算术融合技术增强了模型鲁棒性。本模型基于Apache 2.0协议发布检查点,相关数据集及LightOnOCR-bbox-bench评估基准亦按各自许可公开。
English
We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.
PDF51January 22, 2026