思考与行动：通过扩展测试时交互进行推理的智能体

摘要

当前测试时扩展的范式依赖于在生成响应前产生长推理轨迹（即“更多思考”）。在需要交互的智能体问题中，这可以通过在行动前生成思考轨迹来实现。然而，这一过程不允许智能体从环境中获取新信息或随时间调整其行为。在本研究中，我们提出扩展测试时交互，这是一个尚未开发的测试时扩展维度，它通过延长智能体的交互视野，使其能够在单次运行中执行丰富的行为，如探索、回溯和动态重规划。为展示这一扩展维度的潜力，我们以网页智能体领域为例进行探讨。首先，我们证明即使仅基于提示的交互扩展，无需任何训练，也能在网页基准测试上显著提升任务成功率。在此基础上，我们引入了TTI（测试时交互），一种基于课程学习的在线强化学习（RL）方法，通过自适应调整智能体的运行长度来训练它们。利用Gemma 3 12B模型，TTI在WebVoyager和WebArena基准测试上打造了当前最先进的开源、开放数据网页智能体。我们进一步展示了TTI如何使智能体自适应地平衡探索与利用。我们的研究结果确立了交互扩展作为与每步计算扩展相辅相成的强大新维度，为训练自适应智能体开辟了新途径。

English

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

思考与行动：通过扩展测试时交互进行推理的智能体

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

摘要

Support