マルチモーダル推論のための質問認識型Vision Transformer

要旨

Vision-Language（VL）モデルは、多モーダル推論における顕著な進展を可能にし、重要な研究焦点となっています。これらのアーキテクチャは通常、視覚エンコーダ、大規模言語モデル（LLM）、および視覚的特徴をLLMの表現空間に整合させる投影モジュールで構成されています。その成功にもかかわらず、重要な制限が残っています：視覚エンコーディングプロセスは、画像に関連する質問という形で提示されるユーザークエリから切り離されたままです。その結果、生成される視覚的特徴は、クエリ固有の画像要素に最適に調整されていない可能性があります。この問題に対処するため、我々はQA-ViT（Question Aware Vision Transformer）を提案します。これは、多モーダル推論のための質問認識を視覚エンコーダに直接組み込むアプローチであり、提示された質問に関連する画像の側面に焦点を当てた動的な視覚的特徴を生成します。QA-ViTはモデルに依存せず、任意のVLアーキテクチャに効率的に組み込むことができます。広範な実験により、我々の手法を様々な多モーダルアーキテクチャに適用することの有効性が実証され、多様なタスクにわたる一貫した改善が示され、視覚的およびシーンテキスト理解の向上における可能性が示されています。

English

Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled from user queries, often in the form of image-related questions. Consequently, the resulting visual features may not be optimally attuned to the query-specific elements of the image. To address this, we introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning, which embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question. QA-ViT is model-agnostic and can be incorporated efficiently into any VL architecture. Extensive experiments demonstrate the effectiveness of applying our method to various multimodal architectures, leading to consistent improvement across diverse tasks and showcasing its potential for enhancing visual and scene-text understanding.

マルチモーダル推論のための質問認識型Vision Transformer

Question Aware Vision Transformer for Multimodal Reasoning

要旨

Support