VRAG-RL：強化学習による反復的推論を活用した視覚的豊富な情報理解のための視覚知覚ベースRAGの強化

要旨

視覚的に豊富な情報を効果的に検索し、推論し、理解することは、RAG（Retrieval-Augmented Generation）手法にとって依然として課題である。従来のテキストベースの手法では、視覚に関連する情報を扱うことができない。一方、現在の視覚ベースのRAGアプローチは、固定されたパイプラインに制限されることが多く、モデルの基本的な能力が十分に活性化されないため、効果的な推論に苦戦することが多い。RL（強化学習）がモデルの推論に有益であることが証明されていることから、我々は視覚的に豊富な情報にわたる複雑な推論に特化した新しいRLフレームワークであるVRAG-RLを提案する。このフレームワークでは、視覚言語モデル（VLM）が検索エンジンと相互作用し、視覚知覚トークンの助けを借りて単一ターンまたは複数ターンの推論軌跡を自律的にサンプリングし、これらのサンプルに基づいて継続的に最適化を行う。我々のアプローチは、RAG領域におけるRLの主要な限界を強調している：（i）従来のマルチモーダルRAGアプローチは、単に画像をコンテキストに組み込む傾向があり、推論トークンの割り当てが不十分で、視覚固有の知覚を無視している；（ii）モデルが検索エンジンと相互作用する際、そのクエリは要件を明確に表現できないため、関連情報を検索できず、結果として最適でない性能を引き起こす。これらの課題に対処するため、我々は視覚的に豊富な入力に特化したアクション空間を定義し、クロッピングやスケーリングなどのアクションを含めることで、モデルが粗から細かい視点で情報を収集できるようにした。さらに、ユーザーの元の質問と検索エンジンの間のギャップを埋めるために、クエリの書き換えと検索性能をモデルベースの報酬と統合したシンプルかつ効果的な報酬を採用した。我々のVRAG-RLは、特別に設計されたRL戦略を使用してRAGタスクに最適化されたVLMを提供し、モデルを現実世界のアプリケーションに適合させる。コードはhttps://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}で公開されている。

English

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}.

VRAG-RL：強化学習による反復的推論を活用した視覚的豊富な情報理解のための視覚知覚ベースRAGの強化

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

要旨

Support