基於知識的視覺問答:多模態處理、檢索與過濾
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
October 16, 2025
作者: Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye
cs.AI
摘要
基於知識的視覺問答(KB-VQA)要求視覺語言模型(VLMs)將視覺理解與外部知識檢索相結合。儘管檢索增強生成(RAG)通過結合知識庫查詢在該任務中取得了顯著進展,但在多模態查詢的質量和檢索結果的相關性方面仍存在挑戰。為克服這些難題,我們提出了一種新穎的三階段方法,稱為Wiki-PRF,包括處理、檢索和過濾階段。處理階段動態調用視覺工具以提取精確的多模態信息進行檢索。檢索階段整合視覺與文本特徵,實現多模態知識檢索。過濾階段則對檢索結果進行相關性過濾與集中處理。為此,我們引入了一種視覺語言模型,該模型通過強化學習方式,以答案準確性和格式一致性作為獎勵信號進行訓練。這增強了模型的推理能力、精確查詢的工具調用能力以及無關內容的過濾能力。在基準數據集(E-VQA和InfoSeek)上的實驗顯示,答案質量顯著提升(36.0和42.8),達到了最先進的性能。代碼可在https://github.com/cqu-student/Wiki-PRF獲取。
English
Knowledge-based visual question answering (KB-VQA) requires visual language
models (VLMs) to integrate visual understanding with external knowledge
retrieval. Although retrieval-augmented generation (RAG) achieves significant
advances in this task by combining knowledge-base querying, it still struggles
with the quality of multimodal queries and the relevance of retrieved results.
To overcome these challenges, we propose a novel three-stage method, termed
Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing
stage dynamically invokes visual tools to extract precise multimodal
information for retrieval. The retrieval stage integrates visual and text
features to achieve multimodal knowledge retrieval. The filtering stage
performs relevance filtering and concentration on retrieval results. To this
end, we introduce a visual language model trained with answer accuracy and
format consistency as reward signals via a reinforcement learning manner. This
enhances the model's reasoning, tool invocation for accurate queries, and
filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and
InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality,
achieving state-of-the-art performance. Code is available at
https://github.com/cqu-student/Wiki-PRF