POINTS-Reader: 文書変換のための蒸留不要な視覚言語モデル適応

要旨

高品質なラベル付きデータは、特に表、数式、複数列テキストなどの複雑なフォーマットを含むドメインにおいて、正確な文書変換モデルを訓練するために不可欠です。しかし、手動によるアノテーションはコストと時間がかかる一方で、既存のモデルを用いた自動ラベル付けは、このような困難なシナリオを扱う際に精度が不足しがちです。その結果、教師モデルからの出力を蒸留して学生モデルを訓練することは、実世界のアプリケーションにおける性能を大幅に制限する可能性があります。本論文では、多様な文書フォーマットとレイアウトを扱える高品質な文書抽出データセットとモデルを構築するための、完全に自動化された蒸留不要の2段階フレームワークを提案します。第1段階では、大規模で多様な合成データを生成する方法を導入し、モデルが統一されたフォーマットで主要な要素を抽出できるようにし、強力な初期性能を実現します。第2段階では、合成データで初期訓練されたモデルを実世界の文書にさらに適応させる自己改善アプローチを提示します。具体的には、まず微調整されたモデルを使用して実文書にアノテーションを付け、次に一連のフィルタリング戦略を適用してアノテーションの品質を検証し、最後に検証済みのデータセットでモデルを再訓練します。このプロセスを反復的に繰り返すことで、モデルの変換能力と生成データの品質を段階的に向上させます。公開されているPOINTS-1.5モデルを訓練してPOINTS-Readerを取得し、これは多くの既存の公開および専有モデルを上回る性能を発揮します。私たちのモデルはhttps://github.com/Tencent/POINTS-Readerで利用可能です。

English

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

POINTS-Reader: 文書変換のための蒸留不要な視覚言語モデル適応

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

要旨

Support