TexOCR：推动文档OCR模型实现可编译的页面到LaTeX重构

摘要

现有文档OCR技术主要针对纯文本或Markdown格式，舍弃了使LaTeX成为科学出版核心要素的结构化与可编译特性。我们研究将科学文献PDF进行页面级重建为可编译LaTeX的方法，并为此推出基准测试集TexOCR-Bench与大规模训练语料库TexOCR-Train。TexOCR-Bench采用多维度评估体系，同步检验转录保真度、结构忠实度及端到端可编译性。基于TexOCR-Train，我们通过监督微调（SFT）和基于可验证奖励的强化学习（RL）训练出20亿参数模型TexOCR，其中奖励机制源自直接保障可编译性与引用完整性的LaTeX单元测试。在TexOCR-Bench上对21个前沿模型的实验表明，现有系统常违反文档关键不变性——包括一致的章节结构、正确的浮动体定位及有效的标签-引用链接——这些缺陷会破坏编译可靠性及下游可用性。我们的分析进一步揭示，相较于单独使用SFT，结合可验证奖励的RL能持续提升模型性能，尤其在结构与编译指标上表现显著。

English

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.