POINTS-Reader：無需蒸餾的視覺語言模型適應於文件轉換

摘要

高質量的標註數據對於訓練精確的文件轉換模型至關重要，尤其是在包含複雜格式（如表格、公式和多欄文本）的領域中。然而，手動標註既昂貴又耗時，而使用現有模型進行自動標註在處理此類挑戰性場景時往往缺乏準確性。因此，通過從教師模型蒸餾輸出來訓練學生模型，可能會顯著限制其在實際應用中的表現。本文提出了一個完全自動化、無需蒸餾的框架，該框架包含兩個階段，用於構建能夠處理多樣文件格式和佈局的高質量文件提取數據集和模型。在第一階段，我們引入了一種生成大規模、多樣化合成數據的方法，使模型能夠以統一的格式提取關鍵元素，並具備強大的初始性能。在第二階段，我們提出了一種自我改進方法，進一步將最初在合成數據上訓練的模型適應於真實世界的文件。具體而言，我們首先使用微調後的模型對真實文件進行標註，然後應用一系列過濾策略來驗證標註質量，最後在驗證後的數據集上重新訓練模型。通過迭代重複這一過程，我們逐步提升了模型的轉換能力以及生成數據的質量。我們訓練了一個公開的POINTS-1.5模型，獲得了POINTS-Reader，其性能超越了許多現有的公開和專有模型，無論是規模相當還是更大的模型。我們的模型可在https://github.com/Tencent/POINTS-Reader獲取。

English

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

POINTS-Reader：無需蒸餾的視覺語言模型適應於文件轉換

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

摘要

Support