將測試時計算最優縮放推廣為可優化圖結構

摘要

測試時擴展（TTS）技術透過在推理階段分配額外計算資源來提升大型語言模型（LLM）的性能，通常採用並行、序列或混合擴展方式。然而，現有研究往往預設固定的協作架構（如拓撲結構）和單一模型使用模式，忽略了最優架構與模型組合會隨任務不同而變化的特性。為此，我們研究在固定計算預算下，為TTS搜尋計算最優的模型組合與架構這一新問題。我們將其形式化為多LLM協作圖結構：節點編碼角色與LLM模型分配，邊緣捕捉資訊流動。該問題的挑戰在於：（i）組合搜尋空間過於龐大；（ii）任務特定需求需要定制化設計。為解決這些難題，我們將問題重新表述為概率圖優化，並透過預實驗歸納出TTS協作圖的三項實證規律。基於這些規律，我們提出Agent-REINFORCE框架——該框架透過LLM智能體增強，將REINFORCE演算法的「採樣-梯度-更新」流程映射為「採樣-反饋-更新」，其中文字化反饋作為梯度替代品來更新概率圖，從而高效搜尋最優多LLM協作圖。實驗表明，Agent-REINFORCE在樣本效率和搜尋性能上均優於傳統基線與LLM基線，並能有效在準確率與推理延遲的聯合目標下識別最優圖結構。

English

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

將測試時計算最優縮放推廣為可優化圖結構

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

摘要

Support