AgentTTS:面向复杂任务测试时计算最优扩展策略的大语言模型代理
AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks
July 26, 2025
作者: Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang
cs.AI
摘要
测试时扩展(TTS)通过为推理阶段分配额外的计算资源,提升了大型语言模型(LLMs)的性能。然而,现有研究主要探讨了单阶段任务中的TTS;而现实世界中的许多问题属于多阶段复杂任务,由一系列异质子任务构成,每个子任务都需要具备特定能力的LLM。因此,我们研究了一个新颖问题:多阶段复杂任务中的测试时计算最优扩展,旨在为每个子任务选择合适的模型并分配预算,以最大化整体性能。多阶段任务中的TTS引入了两大基础挑战:(i)模型与预算分配的组合搜索空间,加之推理的高昂成本,使得暴力搜索不切实际。(ii)各子任务间最优模型与预算分配相互依赖,增加了计算最优搜索的复杂性。为填补这一空白,我们在六个数据集上的四项任务中开展了广泛的先导实验,得出了三条经验性见解,刻画了LLMs在多阶段复杂任务中的行为特征。基于这些见解,我们提出了AgentTTS,一个基于LLM代理的框架,它通过与执行环境的迭代反馈驱动交互,自主搜索计算最优分配。实验结果表明,AgentTTS在搜索效率上显著优于传统及其他基于LLM的基线方法,并在面对不同训练集规模时展现出更强的鲁棒性,同时提高了可解释性。
English
Test-time scaling (TTS) enhances the performance of large language models
(LLMs) by allocating additional compute resources during inference. However,
existing research primarily investigates TTS in single-stage tasks; while many
real-world problems are multi-stage complex tasks, composed of a sequence of
heterogeneous subtasks with each subtask requires LLM of specific capability.
Therefore, we study a novel problem: the test-time compute-optimal scaling in
multi-stage complex tasks, aiming to select suitable models and allocate
budgets per subtask to maximize overall performance. TTS in multi-stage tasks
introduces two fundamental challenges: (i) The combinatorial search space of
model and budget allocations, combined with the high cost of inference, makes
brute-force search impractical. (ii) The optimal model and budget allocations
across subtasks are interdependent, increasing the complexity of the
compute-optimal search. To address this gap, we conduct extensive pilot
experiments on four tasks across six datasets, deriving three empirical
insights characterizing the behavior of LLMs in multi-stage complex tasks.
Informed by these insights, we propose AgentTTS, an LLM-agent-based framework
that autonomously searches for compute-optimal allocations through iterative
feedback-driven interactions with the execution environment. Experimental
results demonstrate that AgentTTS significantly outperforms traditional and
other LLM-based baselines in search efficiency, and shows improved robustness
to varying training set sizes and enhanced interpretability.