知識ベース視覚質問応答におけるマルチモーダル処理、検索、フィルタリング

要旨

知識ベース視覚質問応答（KB-VQA）では、視覚言語モデル（VLM）が視覚的理解と外部知識検索を統合する必要がある。検索拡張生成（RAG）は、知識ベースのクエリを組み合わせることでこのタスクにおいて大きな進展を遂げているが、マルチモーダルクエリの品質と検索結果の関連性において依然として課題を抱えている。これらの課題を克服するため、我々は「Wiki-PRF」と称する新たな三段階の手法を提案する。この手法は、処理、検索、フィルタリングの段階から構成される。処理段階では、視覚ツールを動的に呼び出して正確なマルチモーダル情報を抽出し、検索に活用する。検索段階では、視覚的特徴とテキスト特徴を統合し、マルチモーダル知識検索を実現する。フィルタリング段階では、検索結果に対して関連性フィルタリングと集中処理を行う。これにより、回答精度と形式一貫性を報酬信号として強化学習方式で訓練された視覚言語モデルを導入する。これにより、モデルの推論能力、正確なクエリのためのツール呼び出し、および無関係なコンテンツのフィルタリングが強化される。ベンチマークデータセット（E-VQAおよびInfoSeek）での実験では、回答品質において大幅な改善（36.0および42.8）が確認され、最先端の性能を達成した。コードはhttps://github.com/cqu-student/Wiki-PRFで公開されている。

English

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

知識ベース視覚質問応答におけるマルチモーダル処理、検索、フィルタリング

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

要旨

Support