面向多模态推理的问题感知视觉Transformer

摘要

视觉-语言（VL）模型已经成为重要的研究焦点，实现了多模态推理方面的显著进展。这些架构通常包括一个视觉编码器、一个大型语言模型（LLM）以及一个将视觉特征与LLM表示空间对齐的投影模块。尽管取得了成功，但一个关键限制仍然存在：视觉编码过程与用户查询（通常以与图像相关的问题形式出现）仍然分离。因此，生成的视觉特征可能无法最佳地调整到图像的特定查询元素。为了解决这个问题，我们引入了QA-ViT，这是一种用于多模态推理的问题感知视觉Transformer方法，直接将问题感知嵌入到视觉编码器中。这种整合产生了动态的视觉特征，专注于与提出的问题相关的图像方面。QA-ViT是模型无关的，可以高效地整合到任何VL架构中。大量实验证明了将我们的方法应用于各种多模态架构的有效性，从而在各种任务中实现了一致的改进，并展示了其增强视觉和场景文本理解潜力。

English

Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled from user queries, often in the form of image-related questions. Consequently, the resulting visual features may not be optimally attuned to the query-specific elements of the image. To address this, we introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning, which embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question. QA-ViT is model-agnostic and can be incorporated efficiently into any VL architecture. Extensive experiments demonstrate the effectiveness of applying our method to various multimodal architectures, leading to consistent improvement across diverse tasks and showcasing its potential for enhancing visual and scene-text understanding.

面向多模态推理的问题感知视觉Transformer

Question Aware Vision Transformer for Multimodal Reasoning

摘要

Support