I2CR: マルチモーダルエンティティリンキングのためのモダリティ内およびモダリティ間協調的リフレクション

要旨

マルチモーダルエンティティリンキングは、幅広いアプリケーションにおいて重要な役割を果たしている。近年、大規模言語モデルを基盤とした手法がこのタスクにおける主要なパラダイムとなり、テキストと視覚の両モダリティを効果的に活用することで性能を向上させている。しかし、これらの手法は依然として二つの課題に直面しており、特定のシナリオにおける画像データの不必要な取り込みと、視覚的特徴の一度きりの抽出に依存することによる有効性と精度の低下が挙げられる。これらの課題に対処するため、我々はマルチモーダルエンティティリンキングタスクのための新しいLLMベースのフレームワークを提案する。このフレームワークは「Intra- and Inter-modal Collaborative Reflections」と呼ばれ、タスクを解決するためにテキスト情報の活用を優先する。テキストだけではエンティティを正しくリンクできない場合、モダリティ内およびモダリティ間の評価を通じて、画像の様々な側面から得られる重要な視覚的手がかりを統合し、推論を支援しマッチング精度を向上させる多段階反復戦略を採用する。広く使用されている3つの公開データセットを用いた大規模な実験により、我々のフレームワークが現在の最先端手法を一貫して上回り、それぞれ3.2%、5.1%、1.6%の改善を達成することが示された。コードはhttps://github.com/ziyan-xiaoyu/I2CR/で公開されている。

English

Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

I2CR: マルチモーダルエンティティリンキングのためのモダリティ内およびモダリティ間協調的リフレクション

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

要旨

Support