ChatPaper.aiChatPaper

I2CR:多模态实体链接中的模态内与模态间协同反思

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

August 4, 2025
作者: Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu
cs.AI

摘要

多模态实体链接在众多应用中扮演着关键角色。基于大规模语言模型的方法的最新进展已成为该任务的主导范式,有效结合文本与视觉模态以提升性能。尽管取得了成功,这些方法仍面临两大挑战:一是在某些场景下不必要地引入图像数据,二是仅依赖一次性提取的视觉特征,这可能会削弱其效果与准确性。为解决这些挑战,我们提出了一种新颖的基于LLM的多模态实体链接框架,名为“内-跨模态协同反思”。该框架优先利用文本信息处理任务。当仅凭文本通过内-跨模态评估不足以链接到正确实体时,它采用多轮迭代策略,整合图像中多方面的关键视觉线索,以支持推理并提高匹配精度。在三个广泛使用的公开数据集上的大量实验表明,我们的框架在该任务中持续超越当前最先进的方法,分别实现了3.2%、5.1%和1.6%的性能提升。我们的代码已发布于https://github.com/ziyan-xiaoyu/I2CR/。
English
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
PDF12August 8, 2025