思考與行動:通過擴展測試時互動進行推理的智能體
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
June 9, 2025
作者: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
cs.AI
摘要
當前測試時規模化的範式依賴於在生成回應前產生長推理軌跡(即“多思考”)。在需要互動的代理問題中,這可以通過在行動前生成思考軌跡來實現。然而,這一過程並不允許代理從環境中獲取新信息或隨時間調整其行為。在本研究中,我們提出擴展測試時互動,這是一個尚未開發的測試時規模化維度,它通過增加代理的互動視野來實現單次運行中豐富行為的執行,如探索、回溯和動態重新規劃。為了展示這一規模化維度的潛力,我們研究了網絡代理領域。我們首先表明,即使沒有任何訓練的基於提示的互動規模化也能在網絡基準測試中顯著提高任務成功率。基於此,我們引入了TTI(測試時互動),這是一種基於課程的在線強化學習(RL)方法,通過自適應調整代理的運行長度來訓練代理。使用Gemma 3 12B模型,TTI在WebVoyager和WebArena基準測試中產生了最先進的開源、開放數據網絡代理。我們進一步展示,TTI使代理能夠自適應地平衡探索與利用。我們的結果確立了互動規模化作為與每步計算規模化互補的強大軸線,為訓練自適應代理提供了新的途徑。
English
The current paradigm of test-time scaling relies on generating long reasoning
traces ("thinking" more) before producing a response. In agent problems that
require interaction, this can be done by generating thinking traces before
acting in the world. However, this process does not allow agents to acquire new
information from the environment or adapt their behavior over time. In this
work, we propose to scale test-time interaction, an untapped dimension of
test-time scaling that increases the agent's interaction horizon to enable
running rich behaviors such as exploration, backtracking, and dynamic
re-planning within a single rollout. To demonstrate the promise of this scaling
dimension, we study the domain of web agents. We first show that even
prompting-based interaction scaling without any training can improve task
success on web benchmarks non-trivially. Building on this, we introduce TTI
(Test-Time Interaction), a curriculum-based online reinforcement learning (RL)
approach that trains agents by adaptively adjusting their rollout lengths.
Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data
web agents on WebVoyager and WebArena benchmarks. We further show that TTI
enables agents to balance exploration and exploitation adaptively. Our results
establish interaction scaling as a powerful, complementary axis to scaling
per-step compute, offering new avenues for training adaptive agents.