AgentTTS：面向复杂任务测试时计算最优扩展策略的大语言模型代理

摘要

测试时扩展（TTS）通过为推理阶段分配额外的计算资源，提升了大型语言模型（LLMs）的性能。然而，现有研究主要探讨了单阶段任务中的TTS；而现实世界中的许多问题属于多阶段复杂任务，由一系列异质子任务构成，每个子任务都需要具备特定能力的LLM。因此，我们研究了一个新颖问题：多阶段复杂任务中的测试时计算最优扩展，旨在为每个子任务选择合适的模型并分配预算，以最大化整体性能。多阶段任务中的TTS引入了两大基础挑战：（i）模型与预算分配的组合搜索空间，加之推理的高昂成本，使得暴力搜索不切实际。（ii）各子任务间最优模型与预算分配相互依赖，增加了计算最优搜索的复杂性。为填补这一空白，我们在六个数据集上的四项任务中开展了广泛的先导实验，得出了三条经验性见解，刻画了LLMs在多阶段复杂任务中的行为特征。基于这些见解，我们提出了AgentTTS，一个基于LLM代理的框架，它通过与执行环境的迭代反馈驱动交互，自主搜索计算最优分配。实验结果表明，AgentTTS在搜索效率上显著优于传统及其他基于LLM的基线方法，并在面对不同训练集规模时展现出更强的鲁棒性，同时提高了可解释性。

English

Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

AgentTTS：面向复杂任务测试时计算最优扩展策略的大语言模型代理

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

摘要

Support