FormNetV2：用于形式文档信息提取的多模态图对比学习

摘要

最近出现的自监督预训练技术的兴起导致了多模态学习在形式文件理解中的广泛应用。然而，现有的将掩码语言建模扩展到其他模态的方法需要仔细的多任务调整、复杂的重构目标设计或额外的预训练数据。在FormNetV2中，我们引入了一种集中的多模态图对比学习策略，将所有模态的自监督预训练统一为一个损失函数。图对比目标最大化多模态表示的一致性，为所有模态提供自然的相互作用，无需特殊定制。此外，我们提取了连接图边缘上一对令牌的边界框内的图像特征，捕获更有针对性的视觉线索，而无需加载复杂且单独预训练的图像嵌入器。FormNetV2在FUNSD、CORD、SROIE和Payment基准测试中建立了新的最先进性能，同时具有更紧凑的模型尺寸。

English

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

FormNetV2：用于形式文档信息提取的多模态图对比学习

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

摘要

Support