ノイズ対応型レイアウト認識言語モデルのトレーニング

要旨

視覚的にリッチなドキュメント（VRD）は、視覚的特徴と言語的キューを活用して情報を伝達します。ドキュメントから固有表現を識別するカスタム抽出器をトレーニングするには、テキストと視覚の両モダリティでアノテーションされた対象ドキュメントタイプの多数のインスタンスが必要です。これは企業シナリオにおいて高コストなボトルネックとなり、数千種類の異なるドキュメントタイプに対してスケーラブルな方法でカスタム抽出器をトレーニングしたい場合に問題となります。対象ドキュメントタイプの未ラベルインスタンスで抽出器モデルを事前トレーニングし、その後人間がラベル付けしたインスタンスでファインチューニングする方法は、これらのシナリオでは抽出器に割り当てられた最大許容トレーニング時間を超えるため機能しません。本論文では、このシナリオに対処するため、ノイズ対応トレーニング手法（Noise-Aware Training、NAT）を提案します。NATは、高コストな人間によるラベル付けドキュメントを取得する代わりに、弱ラベル付きドキュメントを活用してスケーラブルな方法で抽出器をトレーニングします。ノイズの多い弱ラベルサンプルによるモデル品質の低下を防ぐため、NATは各トレーニングサンプルの信頼度を推定し、それをトレーニング中の不確実性指標として組み込みます。NATを使用して複数の最先端抽出器モデルをトレーニングしました。多数の公開データセットおよび社内データセットでの実験結果から、NATでトレーニングされたモデルは性能がロバストであるだけでなく、マクロF1スコアにおいて転移学習ベースラインを最大6%上回り、さらにラベル効率も向上し、同等の性能を得るために必要な人間の労力を最大73%削減できることが示されました。

English

A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target document type annotated at textual and visual modalities. This is an expensive bottleneck in enterprise scenarios, where we want to train custom extractors for thousands of different document types in a scalable way. Pre-training an extractor model on unlabeled instances of the target document type, followed by a fine-tuning step on human-labeled instances does not work in these scenarios, as it surpasses the maximum allowable training time allocated for the extractor. We address this scenario by proposing a Noise-Aware Training method or NAT in this paper. Instead of acquiring expensive human-labeled documents, NAT utilizes weakly labeled documents to train an extractor in a scalable way. To avoid degradation in the model's quality due to noisy, weakly labeled samples, NAT estimates the confidence of each training sample and incorporates it as uncertainty measure during training. We train multiple state-of-the-art extractor models using NAT. Experiments on a number of publicly available and in-house datasets show that NAT-trained models are not only robust in performance -- it outperforms a transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is also more label-efficient -- it reduces the amount of human-effort required to obtain comparable performance by up to 73%.

ノイズ対応型レイアウト認識言語モデルのトレーニング

Noise-Aware Training of Layout-Aware Language Models

要旨

Support