UniDoc-RL：階層的アクションと密な報酬による粗い粒度から細かい粒度への視覚的RAG

要旨

Retrieval-Augmented Generation (RAG) は、大規模視覚言語モデル (LVLM) を外部の視覚的知識で拡張する技術である。しかし、既存の視覚的 RAG システムは一般に、複雑な推論に不可欠なきめ細かい視覚的意味情報を見落としがちな、汎用的な検索信号に依存している。この限界に対処するため、我々は LVLM エージェントが検索、再ランキング、能動的視覚的知覚、推論を統合的に実行する強化学習フレームワーク、UniDoc-RL を提案する。UniDoc-RL は、視覚的情報の獲得を階層的な行動空間を持つ逐次意思決定問題として定式化する。具体的には、粗い粒度の文書検索から、細かい粒度の画像選択、能動的な領域切り出しへと、視覚的証拠を段階的に洗練させることで、モデルが無関係なコンテンツを抑制し、情報密度の高い領域に注意を向けることを可能にする。効果的なエンドツーエンド学習のために、各行動に対してタスクを意識した監督信号を提供する高密度マルチ報酬スキームを導入する。Group Relative Policy Optimization (GRPO) に基づく UniDoc-RL は、分離した価値関数ネットワークに依存することなく、エージェントの行動を複数の目的に沿わせる。この学習パラダイムを支援するため、細粒度の行動アノテーションが付いた高品質な推論軌跡の包括的データセットを構築した。3つのベンチマークによる実験では、UniDoc-RL が既存の最先端ベースラインを一貫して上回り、従来の強化学習ベース手法に対して最大 17.7% の性能向上をもたらすことを実証した。

English

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

UniDoc-RL：階層的アクションと密な報酬による粗い粒度から細かい粒度への視覚的RAG

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

要旨

Support