将测试时计算最优缩放推广为可优化图结构

摘要

测试时扩展（TTS）技术通过推理阶段的额外计算分配来提升大语言模型（LLM）性能，通常采用并行、串行或混合扩展方式。然而既有研究往往预设固定的协作架构（如拓扑结构）和单一模型使用模式，忽略了最优架构与模型组合会随任务动态变化的特性。为此，我们首次系统研究了固定预算下TTS中计算最优的模型组合与架构搜索问题，将其形式化为多LLM协作图模型：节点编码角色与LLM模型分配，边捕捉信息流动。该问题面临双重挑战：（i）组合搜索空间过于庞大；（ii）任务特性需要定制化设计。我们通过概率图重构该问题，并基于预实验总结出TTS协作图的三项经验规律。基于这些发现，我们提出Agent-REINFORCE框架，通过LLM智能体强化REINFORCE流程，将“采样-梯度-更新”映射为“采样-反馈-更新”——其中文本化反馈作为梯度更新概率图，从而高效搜索最优多LLM协作图。实验表明，该方法在样本效率和搜索性能上均优于传统及LLM基线，并能有效平衡准确率与推理延迟的双重目标。

English

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

将测试时计算最优缩放推广为可优化图结构

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

摘要

Support