ProMSA：プログレッシブマルチモーダル検索エージェントを用いた知識ベースの視覚的質問応答

要旨

知識ベースの視覚的質問応答（KB-VQA）では、画像理解と外部知識を組み合わせるモデルが求められる。従来の手法の多くは、固定された検索器と静的なtop-k設定を用いた「検索→生成」パイプラインを採用しており、推論中に適応的ではない。本稿では、KB-VQAのための漸進型マルチモーダル検索エージェントProMSAを提案する。画像と質問のペアが与えられると、エージェントは明示的なツール呼び出し予算の下で、画像検索、テキスト検索、または停止の選択を反復的に行い、重複検索を回避するために重複排除を実施する。訓練においては、まず拒否サンプリングを用いた教師ありファインチューニング（SFT）により、有効なツール使用形式を学習し、その後、生成長さとツール対話深度の両方で更新を正規化する系列レベルの強化学習目的関数TN-GSPOを用いてエージェントを最適化する。E-VQAおよびInfoSeekでの実験により、強力なRAGベースラインやエージェントベースラインを一貫して上回り、検索精度およびエンドツーエンド精度が向上した。コードはhttps://github.com/DingWu1021/Promsaで公開されている。

English

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.