面向智能体型及多模态大语言模型的上下文感知强化学习

摘要

大型语言模型（LLMs）在需要从冗长或复杂上下文中识别细小但关键证据时常常失败，例如工具调用轨迹中的一行代码或图像中的细微细节。我们提出ContextRL，一种上下文感知的强化学习方法，通过间接辅助目标提升长程推理与多模态性能。该方法并非仅监督最终答案，而是向模型提供查询、答案以及两个高度相似的上下文，并奖励模型选择支持该查询-答案对的上下文，从而鼓励细粒度定位。我们在两个领域构建对比上下文数据：对于代码代理，将轨迹作为上下文，通过条件过滤生成1000对数据；对于多模态推理，将图像作为上下文，通过生成式编辑与相似性搜索构建7000对数据。ContextRL在5个长程基准上相较标准GRPO平均提升+2.2%，在12个多样化视觉问答基准上平均提升+1.8%。为分离所提目标与额外数据的影响，我们将同一批对比上下文重新加工为标准查询-上下文-答案示例作为数据增强基线，该基线几乎未带来改进，说明性能提升源于所提上下文选择目标而非仅依赖对比数据本身。

English

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.