FormNetV2: フォームドキュメント情報抽出のためのマルチモーダルグラフ対比学習

要旨

近年の自己教師あり事前学習技術の登場により、フォーム文書理解におけるマルチモーダル学習の利用が急増しています。しかし、マスク言語モデリングを他のモダリティに拡張する既存のアプローチでは、慎重なマルチタスクチューニング、複雑な再構成ターゲット設計、または追加の事前学習データが必要です。FormNetV2では、すべてのモダリティの自己教師あり事前学習を一つの損失関数に統合するための集中型マルチモーダルグラフ対比学習戦略を導入します。グラフ対比目的関数は、マルチモーダル表現の一致を最大化し、特別なカスタマイズなしにすべてのモダリティの自然な相互作用を提供します。さらに、グラフエッジで接続されたトークンペアを結合するバウンディングボックス内の画像特徴を抽出し、複雑で別途事前学習された画像エンコーダをロードすることなく、よりターゲットを絞った視覚的キューを捕捉します。FormNetV2は、よりコンパクトなモデルサイズで、FUNSD、CORD、SROIE、およびPaymentベンチマークにおいて新たな最先端の性能を確立しました。

English

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

FormNetV2: フォームドキュメント情報抽出のためのマルチモーダルグラフ対比学習

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

要旨

Support