思考與行動：通過擴展測試時互動進行推理的智能體

摘要

當前測試時規模化的範式依賴於在生成回應前產生長推理軌跡（即“多思考”）。在需要互動的代理問題中，這可以通過在行動前生成思考軌跡來實現。然而，這一過程並不允許代理從環境中獲取新信息或隨時間調整其行為。在本研究中，我們提出擴展測試時互動，這是一個尚未開發的測試時規模化維度，它通過增加代理的互動視野來實現單次運行中豐富行為的執行，如探索、回溯和動態重新規劃。為了展示這一規模化維度的潛力，我們研究了網絡代理領域。我們首先表明，即使沒有任何訓練的基於提示的互動規模化也能在網絡基準測試中顯著提高任務成功率。基於此，我們引入了TTI（測試時互動），這是一種基於課程的在線強化學習（RL）方法，通過自適應調整代理的運行長度來訓練代理。使用Gemma 3 12B模型，TTI在WebVoyager和WebArena基準測試中產生了最先進的開源、開放數據網絡代理。我們進一步展示，TTI使代理能夠自適應地平衡探索與利用。我們的結果確立了互動規模化作為與每步計算規模化互補的強大軸線，為訓練自適應代理提供了新的途徑。

English

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

思考與行動：通過擴展測試時互動進行推理的智能體

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

摘要

Support