POINTS-Reader:無需蒸餾的視覺語言模型適應於文件轉換
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
September 1, 2025
作者: Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou
cs.AI
摘要
高質量的標註數據對於訓練精確的文件轉換模型至關重要,尤其是在包含複雜格式(如表格、公式和多欄文本)的領域中。然而,手動標註既昂貴又耗時,而使用現有模型進行自動標註在處理此類挑戰性場景時往往缺乏準確性。因此,通過從教師模型蒸餾輸出來訓練學生模型,可能會顯著限制其在實際應用中的表現。本文提出了一個完全自動化、無需蒸餾的框架,該框架包含兩個階段,用於構建能夠處理多樣文件格式和佈局的高質量文件提取數據集和模型。在第一階段,我們引入了一種生成大規模、多樣化合成數據的方法,使模型能夠以統一的格式提取關鍵元素,並具備強大的初始性能。在第二階段,我們提出了一種自我改進方法,進一步將最初在合成數據上訓練的模型適應於真實世界的文件。具體而言,我們首先使用微調後的模型對真實文件進行標註,然後應用一系列過濾策略來驗證標註質量,最後在驗證後的數據集上重新訓練模型。通過迭代重複這一過程,我們逐步提升了模型的轉換能力以及生成數據的質量。我們訓練了一個公開的POINTS-1.5模型,獲得了POINTS-Reader,其性能超越了許多現有的公開和專有模型,無論是規模相當還是更大的模型。我們的模型可在https://github.com/Tencent/POINTS-Reader獲取。
English
High-quality labeled data is essential for training accurate document
conversion models, particularly in domains with complex formats such as tables,
formulas, and multi-column text. However, manual annotation is both costly and
time-consuming, while automatic labeling using existing models often lacks
accuracy in handling such challenging scenarios. Consequently, training student
models by distilling outputs from teacher models can significantly limit their
performance in real-world applications. In this paper, we propose a fully
automated, distillation-free framework comprising two stages for constructing
high-quality document extraction datasets and models capable of handling
diverse document formats and layouts. In the first stage, we introduce a method
for generating large-scale, diverse synthetic data, which enables a model to
extract key elements in a unified format with strong initial performance. In
the second stage, we present a self-improvement approach that further adapts
the model, initially trained on synthetic data, to real-world documents.
Specifically, we first use the fine-tuned model to annotate real documents,
then apply a suite of filtering strategies to verify annotation quality, and
finally retrain the model on the verified dataset. By iteratively repeating
this process, we progressively enhance both the model's conversion capabilities
and the quality of the generated data. We train a public POINTS-1.5 model to
obtain POINTS-Reader, which surpasses many existing public and proprietary
models of comparable or larger size. Our model is available at
https://github.com/Tencent/POINTS-Reader.