에이전트적 및 멀티모달 LLM을 위한 컨텍스트 인식 강화학습

초록

대규모 언어 모델(LLM)은 긴 맥락이나 복잡한 맥락 내에서 작지만 결정적인 증거(예: 도구 추적 내의 단일 줄, 이미지 내의 미묘한 세부 사항)를 식별해야 하는 질문에 응답할 때 종종 실패합니다. 본 논문에서는 간접적인 보조 목표를 통해 장기적 추론 및 멀티모달 성능을 개선하는 맥락 인식 강화 학습(ContextRL) 방법을 제안합니다. ContextRL은 최종 답변만을 감독하는 대신, 모델에게 질의, 답변 및 두 개의 매우 유사한 맥락을 제시하고, 질의-답변 쌍을 지지하는 맥락을 선택하도록 보상함으로써 세밀한 근거 찾기를 장려합니다. 우리는 두 가지 영역에서 대조 맥락 데이터를 구축합니다. 코딩 에이전트의 경우, 궤적을 맥락으로 사용하여 조건 필터링을 통해 1,000쌍을 구축합니다. 멀티모달 추론의 경우, 이미지를 맥락으로 사용하여 생성적 편집 및 유사성 검색을 통해 7,000쌍을 구축합니다. ContextRL은 5가지 장기적 추론 벤치마크에서 표준 GRPO 대비 평균 +2.2%, 12가지 다양한 시각적 질의응답 벤치마크에서 평균 +1.8%의 성능 향상을 달성합니다. 제안된 목표의 효과를 추가 데이터의 효과와 분리하기 위해, 동일한 대조 맥락을 표준 질의-맥락-답변 예시로 재사용하는 데이터 증강 기준선과 비교합니다. 이러한 기준선은 거의 또는 전혀 개선을 보이지 않으며, 이는 성능 향상이 대조 데이터 자체가 아닌 제안된 맥락 선택 목표에서 비롯됨을 보여줍니다.

English

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.