ReSum: コンテキスト要約による長期視野の探索知能の解放

要旨

大規模言語モデル（LLM）ベースのウェブエージェントは、知識集約型タスクにおいて高い性能を発揮しますが、ReActのようなパラダイムではコンテキストウィンドウの制約に直面します。複数のエンティティ、複雑に絡み合った関係性、および高い不確実性を伴う複雑なクエリでは、完全な解決策に到達する前に広範な検索サイクルが必要となり、コンテキスト予算が急速に枯渇してしまいます。この課題を克服するため、我々はReSumという新しいパラダイムを提案します。ReSumは、定期的なコンテキスト要約を通じて無限の探索を可能にし、拡大するインタラクション履歴をコンパクトな推論状態に変換することで、コンテキスト制約を回避しつつ、これまでの発見を維持します。パラダイム適応のため、ReSum-GRPOを提案し、GRPOをセグメント化された軌跡トレーニングとアドバンテージブロードキャストと統合することで、エージェントが要約条件付き推論に慣れるようにします。3つのベンチマークで様々な規模のウェブエージェントを用いた広範な実験により、ReSumがReActに対して平均4.5%の絶対的な改善をもたらし、ReSum-GRPOトレーニング後にはさらに最大8.2%の向上が得られることが示されました。特に、1Kのトレーニングサンプルだけで、我々のWebResummer-30B（WebSailor-30BのReSum-GRPOトレーニング版）は、BrowseComp-zhで33.3%のPass@1、BrowseComp-enで18.3%のPass@1を達成し、既存のオープンソースウェブエージェントを凌駕しています。

English

Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.