FormNetV2：用於表單文件資訊提取的多模態圖對比學習

摘要

最近出現的自監督預訓練技術的興起，導致多模態學習在形式文件理解中的應用急劇增加。然而，現有的將遮罩語言建模擴展到其他模態的方法需要仔細的多任務調整、複雜的重建目標設計或額外的預訓練數據。在FormNetV2中，我們引入了一種集中的多模態圖對比學習策略，將自監督預訓練統一為一個損失函數，涵蓋所有模態。圖對比目標最大化多模態表示的一致性，為所有模態提供自然的相互作用，無需特殊定制。此外，我們在連接圖邊緣的一對標記的邊界框內提取圖像特徵，捕獲更具針對性的視覺提示，而無需加載複雜且單獨預先訓練的圖像嵌入器。FormNetV2在FUNSD、CORD、SROIE和Payment基準測試中建立了新的最先進性能，並具有更緊湊的模型大小。

English

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

FormNetV2：用於表單文件資訊提取的多模態圖對比學習

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

摘要

Support