VRAG-RL：通过强化学习的迭代推理，赋能基于视觉感知的RAG技术，以深入理解视觉丰富信息

摘要

有效检索、推理和理解视觉丰富信息仍然是RAG方法面临的挑战。传统的基于文本的方法无法处理与视觉相关的信息。另一方面，当前的基于视觉的RAG方法往往受限于固定的流程，并且由于模型基础能力激活不足，常常难以进行有效推理。鉴于强化学习（RL）已被证明对模型推理有益，我们引入了VRAG-RL，这是一个专为跨视觉丰富信息的复杂推理而设计的新型RL框架。在该框架中，视觉语言模型（VLMs）与搜索引擎交互，借助视觉感知标记自主采样单轮或多轮推理轨迹，并基于这些样本进行持续优化。我们的方法凸显了RL在RAG领域的关键局限：（i）先前的多模态RAG方法往往仅将图像融入上下文，导致推理标记分配不足，忽视了视觉特有的感知；（ii）当模型与搜索引擎交互时，其查询往往因无法清晰表达需求而未能检索到相关信息，从而导致性能欠佳。为应对这些挑战，我们定义了一个针对视觉丰富输入量身定制的动作空间，包括裁剪和缩放等动作，使模型能够从粗到细的角度收集信息。此外，为弥合用户原始查询与检索器之间的差距，我们采用了一种简单而有效的奖励机制，将查询重写和检索性能与基于模型的奖励相结合。我们的VRAG-RL通过专门设计的RL策略优化VLMs以执行RAG任务，使模型与现实应用场景对齐。代码可在https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}获取。

English

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}.

VRAG-RL：通过强化学习的迭代推理，赋能基于视觉感知的RAG技术，以深入理解视觉丰富信息

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

摘要

Support