I2CR：多模態實體連結中的模態內與模態間協同反思

摘要

多模態實體連結在多種應用中扮演著至關重要的角色。基於大型語言模型的方法的最新進展已成為該任務的主導範式，有效地結合了文本和視覺模態以提升性能。儘管取得了成功，這些方法仍面臨兩個挑戰，包括在某些場景中不必要地引入圖像數據，以及僅依賴於一次性提取的視覺特徵，這可能削弱其效果和準確性。為應對這些挑戰，我們提出了一種新穎的基於LLM的多模態實體連結框架，稱為「模態內與模態間協同反思」。該框架優先利用文本信息來解決任務。當僅憑文本無法通過模態內與模態間評估連結到正確實體時，它採用多輪迭代策略，整合來自圖像各個方面的關鍵視覺線索，以支持推理並提升匹配準確性。在三個廣泛使用的公共數據集上的大量實驗表明，我們的框架在該任務中始終優於當前最先進的方法，分別實現了3.2%、5.1%和1.6%的提升。我們的代碼可在https://github.com/ziyan-xiaoyu/I2CR/ 獲取。

English

Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

I2CR：多模態實體連結中的模態內與模態間協同反思

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

摘要

Support