I2CR: 다중모달 엔티티 연결을 위한 모달 내 및 모달 간 협력적 반사

초록

멀티모달 엔티티 링크는 다양한 애플리케이션에서 중요한 역할을 합니다. 최근 대규모 언어 모델 기반 방법의 발전은 이 작업에서 주도적인 패러다임이 되었으며, 텍스트와 시각적 모달리티를 효과적으로 활용하여 성능을 향상시켰습니다. 이러한 성공에도 불구하고, 이러한 방법들은 여전히 두 가지 과제에 직면해 있습니다. 이는 특정 시나리오에서 불필요한 이미지 데이터의 통합과 시각적 특징의 일회성 추출에만 의존하는 것으로, 이는 효과성과 정확성을 저해할 수 있습니다. 이러한 과제를 해결하기 위해, 우리는 Intra- 및 Inter-modal Collaborative Reflections라는 새로운 LLM 기반 프레임워크를 제안합니다. 이 프레임워크는 텍스트 정보를 우선적으로 활용하여 작업을 해결합니다. 텍스트만으로는 Intra- 및 Inter-modality 평가를 통해 올바른 엔티티를 링크하기에 불충분한 경우, 이미지의 다양한 측면에서 핵심 시각적 단서를 통합하여 추론을 지원하고 매칭 정확도를 향상시키는 다중 라운드 반복 전략을 사용합니다. 널리 사용되는 세 가지 공개 데이터셋에 대한 광범위한 실험을 통해, 우리의 프레임워크가 현재 최첨단 방법들을 일관되게 능가하며 각각 3.2%, 5.1%, 1.6%의 개선을 달성함을 입증했습니다. 우리의 코드는 https://github.com/ziyan-xiaoyu/I2CR/에서 확인할 수 있습니다.

English

Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

I2CR: 다중모달 엔티티 연결을 위한 모달 내 및 모달 간 협력적 반사

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

초록

Support