POINTS-Reader：无需蒸馏的视觉-语言模型适配文档转换

摘要

高质量标注数据对于训练精确的文档转换模型至关重要，尤其是在处理表格、公式和多栏文本等复杂格式的领域时。然而，手动标注既昂贵又耗时，而利用现有模型进行自动标注在处理此类复杂场景时往往缺乏准确性。因此，通过从教师模型中蒸馏输出来训练学生模型，会显著限制其在现实应用中的表现。本文提出了一种完全自动化、无需蒸馏的两阶段框架，用于构建能够处理多种文档格式和布局的高质量文档提取数据集和模型。在第一阶段，我们引入了一种生成大规模多样化合成数据的方法，使模型能够以统一格式提取关键元素，并具备强大的初始性能。在第二阶段，我们提出了一种自我改进方法，进一步使最初在合成数据上训练的模型适应真实世界的文档。具体而言，我们首先使用微调后的模型对真实文档进行标注，然后应用一系列过滤策略验证标注质量，最后在验证过的数据集上重新训练模型。通过迭代重复这一过程，我们逐步提升了模型的转换能力以及生成数据的质量。我们训练了一个公开的POINTS-1.5模型，获得了POINTS-Reader，该模型在性能上超越了许多现有公开和专有的同类或更大规模模型。我们的模型可在https://github.com/Tencent/POINTS-Reader获取。

English

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

POINTS-Reader：无需蒸馏的视觉-语言模型适配文档转换

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

摘要

Support