エージェント的およびマルチモーダルLLMのための文脈認識強化学習

要旨

大規模言語モデル（LLM）は、長いまたは複雑な文脈の中で、ツールトレースの一行や画像の微妙な細部といった、小さくとも決定的な証拠を特定する必要がある回答において、しばしば失敗する。我々はContextRLを提案する。これは、間接的な補助目的関数を通じて、長期的推論とマルチモーダル性能を向上させる文脈認識型強化学習手法である。最終的な回答のみを監督するのではなく、ContextRLはモデルに対して、クエリ、回答、および極めて類似した二つのコンテキストを提示し、クエリと回答のペアを支持するコンテキストを選択した場合に報酬を与えることで、細粒度の接地を促進する。我々は二つの領域において対照的なコンテキストデータを構築する。コーディングエージェントについては、トレースをコンテキストとして用い、条件フィルタリングにより1kペアを生成する。マルチモーダル推論については、画像をコンテキストとして用い、生成的編集と類似性検索により7kペアを生成する。ContextRLは、5つの長期的ベンチマークにおいて標準的なGRPOを平均+2.2%上回り、12の多様な視覚的質問応答ベンチマークにおいて平均+1.8%の改善を達成する。提案する目的関数の効果を追加データの効果から切り離すため、同一の対照的コンテキストを標準的なクエリ-コンテキスト-回答例として再利用するデータ拡張ベースラインと比較する。これらのベースラインはほとんど改善を示さず、その利得が対照的データ単独ではなく、提案するコンテキスト選択目的関数に起因することを示している。

English

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.