基于知识的视觉问答:多模态处理、检索与过滤
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
October 16, 2025
作者: Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye
cs.AI
摘要
基于知识的视觉问答(KB-VQA)要求视觉语言模型(VLMs)将视觉理解与外部知识检索相结合。尽管检索增强生成(RAG)通过结合知识库查询在这一任务中取得了显著进展,但在多模态查询的质量和检索结果的相关性方面仍存在挑战。为克服这些难题,我们提出了一种新颖的三阶段方法,称为Wiki-PRF,包括处理、检索和过滤阶段。处理阶段动态调用视觉工具以提取精确的多模态信息用于检索;检索阶段整合视觉与文本特征,实现多模态知识检索;过滤阶段则对检索结果进行相关性筛选与聚焦。为此,我们引入了一种视觉语言模型,该模型通过强化学习方式,以答案准确性和格式一致性作为奖励信号进行训练,从而增强模型的推理能力、精准查询的工具调用能力以及对无关内容的过滤能力。在基准数据集(E-VQA和InfoSeek)上的实验表明,该方法在答案质量上实现了显著提升(分别提高了36.0和42.8),达到了当前最优性能。代码已发布于https://github.com/cqu-student/Wiki-PRF。
English
Knowledge-based visual question answering (KB-VQA) requires visual language
models (VLMs) to integrate visual understanding with external knowledge
retrieval. Although retrieval-augmented generation (RAG) achieves significant
advances in this task by combining knowledge-base querying, it still struggles
with the quality of multimodal queries and the relevance of retrieved results.
To overcome these challenges, we propose a novel three-stage method, termed
Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing
stage dynamically invokes visual tools to extract precise multimodal
information for retrieval. The retrieval stage integrates visual and text
features to achieve multimodal knowledge retrieval. The filtering stage
performs relevance filtering and concentration on retrieval results. To this
end, we introduce a visual language model trained with answer accuracy and
format consistency as reward signals via a reinforcement learning manner. This
enhances the model's reasoning, tool invocation for accurate queries, and
filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and
InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality,
achieving state-of-the-art performance. Code is available at
https://github.com/cqu-student/Wiki-PRF