智能体聚合：长周期任务并行扩展的规模化实现

摘要

我们研究针对长周期智能体任务（如智能搜索与深度研究）的并行测试时扩展方法，该方法通过并行生成多个执行轨迹并将其聚合为最终响应。虽然这种扩展在思维链推理中已被证明有效，但智能体任务面临独特挑战：执行轨迹具有长周期、多轮次和工具增强特性，且输出常呈开放型。仅聚合最终答案会丢弃轨迹中的丰富信息，而直接拼接所有轨迹又会超出模型的上下文窗口。为此，我们提出AggAgent聚合智能体，将并行轨迹视为环境，为其配备轻量级工具以检查候选方案并在轨迹间搜索，从而按需导航与合成信息。在六个基准测试和三类模型系列（GLM-4.7、Qwen3.5、MiniMax-M2.5）上的实验表明，AggAgent优于所有现有聚合方法——在深度研究任务上平均绝对提升达5.3%，两项任务最高提升10.3%，且仅需单次智能体执行的有限开销。我们的研究证实，智能体聚合是实现并行测试时扩展的高效经济方案。

English

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.