思考と実行：テスト時の相互作用を拡張することで推論するエージェント

要旨

現在のテストタイムスケーリングのパラダイムは、応答を生成する前に長い推論トレース（「より多く考える」）を生成することに依存している。相互作用を必要とするエージェント問題では、これは世界で行動する前に思考トレースを生成することで行うことができる。しかし、このプロセスでは、エージェントが環境から新しい情報を取得したり、時間の経過とともに行動を適応させたりすることはできない。本研究では、テストタイム相互作用をスケーリングすることを提案する。これは、未開拓のテストタイムスケーリングの次元であり、エージェントの相互作用の視野を広げ、単一のロールアウト内で探索、バックトラッキング、動的再計画などの豊かな行動を実行できるようにする。このスケーリング次元の可能性を示すために、ウェブエージェントの領域を研究する。まず、トレーニングなしのプロンプトベースの相互作用スケーリングでも、ウェブベンチマークでのタスク成功率を非自明に向上させることができることを示す。これを基盤として、TTI（Test-Time Interaction）を導入する。これは、カリキュラムベースのオンライン強化学習（RL）アプローチであり、ロールアウトの長さを適応的に調整することでエージェントをトレーニングする。Gemma 3 12Bモデルを使用して、TTIはWebVoyagerおよびWebArenaベンチマークにおいて、オープンソース、オープンデータのウェブエージェントとして最先端の性能を発揮する。さらに、TTIがエージェントに探索と活用を適応的にバランスさせることを可能にすることを示す。我々の結果は、相互作用スケーリングが、ステップごとの計算スケーリングに対する強力な補完的な軸として確立され、適応型エージェントのトレーニングに新たな道を開くことを示している。

English

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

思考と実行：テスト時の相互作用を拡張することで推論するエージェント

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

要旨

Support