VRAG-RL：透過強化學習的迭代推理，強化基於視覺感知的檢索增強生成技術，以深入理解視覺豐富信息

摘要

有效檢索、推理並理解視覺豐富的信息，對於RAG方法而言仍是一大挑戰。傳統基於文本的方法無法處理與視覺相關的信息。另一方面，現有的基於視覺的RAG方法常受限於固定流程，且由於模型基礎能力激活不足，往往難以有效進行推理。鑑於RL已被證明對模型推理有益，我們引入了VRAG-RL，這是一個專為跨視覺豐富信息的複雜推理而設計的新穎RL框架。在此框架下，VLMs與搜索引擎互動，借助視覺感知標記自主採樣單輪或多輪推理軌跡，並基於這些樣本進行持續優化。我們的方法凸顯了RL在RAG領域的關鍵限制：(i) 先前的多模態RAG方法往往僅將圖像融入上下文，導致推理標記分配不足，忽視了視覺特有的感知；(ii) 當模型與搜索引擎互動時，其查詢常因無法清晰表達需求而未能檢索到相關信息，從而導致性能欠佳。為應對這些挑戰，我們定義了一個專為視覺豐富輸入量身定制的動作空間，包括裁剪和縮放等動作，使模型能從粗到細的視角收集信息。此外，為彌合用戶原始查詢與檢索器之間的差距，我們採用了一種簡單而有效的獎勵機制，該機制整合了查詢重寫與檢索性能，並結合了基於模型的獎勵。我們的VRAG-RL利用專門設計的RL策略優化VLMs以適應RAG任務，使模型與實際應用場景對齊。代碼可在https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}獲取。

English

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}.

VRAG-RL：透過強化學習的迭代推理，強化基於視覺感知的檢索增強生成技術，以深入理解視覺豐富信息

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

摘要

Support