思考与行动:通过扩展测试时交互进行推理的智能体
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
June 9, 2025
作者: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
cs.AI
摘要
当前测试时扩展的范式依赖于在生成响应前产生长推理轨迹(即“更多思考”)。在需要交互的智能体问题中,这可以通过在行动前生成思考轨迹来实现。然而,这一过程不允许智能体从环境中获取新信息或随时间调整其行为。在本研究中,我们提出扩展测试时交互,这是一个尚未开发的测试时扩展维度,它通过延长智能体的交互视野,使其能够在单次运行中执行丰富的行为,如探索、回溯和动态重规划。为展示这一扩展维度的潜力,我们以网页智能体领域为例进行探讨。首先,我们证明即使仅基于提示的交互扩展,无需任何训练,也能在网页基准测试上显著提升任务成功率。在此基础上,我们引入了TTI(测试时交互),一种基于课程学习的在线强化学习(RL)方法,通过自适应调整智能体的运行长度来训练它们。利用Gemma 3 12B模型,TTI在WebVoyager和WebArena基准测试上打造了当前最先进的开源、开放数据网页智能体。我们进一步展示了TTI如何使智能体自适应地平衡探索与利用。我们的研究结果确立了交互扩展作为与每步计算扩展相辅相成的强大新维度,为训练自适应智能体开辟了新途径。
English
The current paradigm of test-time scaling relies on generating long reasoning
traces ("thinking" more) before producing a response. In agent problems that
require interaction, this can be done by generating thinking traces before
acting in the world. However, this process does not allow agents to acquire new
information from the environment or adapt their behavior over time. In this
work, we propose to scale test-time interaction, an untapped dimension of
test-time scaling that increases the agent's interaction horizon to enable
running rich behaviors such as exploration, backtracking, and dynamic
re-planning within a single rollout. To demonstrate the promise of this scaling
dimension, we study the domain of web agents. We first show that even
prompting-based interaction scaling without any training can improve task
success on web benchmarks non-trivially. Building on this, we introduce TTI
(Test-Time Interaction), a curriculum-based online reinforcement learning (RL)
approach that trains agents by adaptively adjusting their rollout lengths.
Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data
web agents on WebVoyager and WebArena benchmarks. We further show that TTI
enables agents to balance exploration and exploitation adaptively. Our results
establish interaction scaling as a powerful, complementary axis to scaling
per-step compute, offering new avenues for training adaptive agents.