Verallgemeinerung der rechenoptimalen Skalierung zur Testzeit als optimierbarer Graph

papers.abstract

Test-Time Scaling (TTS) verbessert große Sprachmodelle (LLMs), indem während der Inferenz zusätzliche Rechenleistung bereitgestellt wird, typischerweise durch paralleles, sequenzielles oder hybrides Skalieren. Bisherige Studien gehen jedoch oft von festen Kollaborationsarchitekturen (z.B. Topologien) und Einzelmodell-Nutzung aus und übersehen, dass die optimalen Architekturen und Modellkombinationen aufgabenabhängig variieren können. Daher untersuchen wir das neuartige Problem, unter festem Budget rechenoptimale Modellkombinationen und Architekturen für TTS zu finden. Wir formalisieren dies als Multi-LLM-Kollaborationsgraph, wobei Knoten Rollen und LLM-Modellzuweisungen kodieren und Kanten den Informationsfluss erfassen. Dieses Problem ist anspruchsvoll, weil (i) der kombinatorische Suchraum prohibitiv groß ist und (ii) aufgabenspezifische Anforderungen maßgeschneiderte Designs erfordern. Zur Lösung reformulieren wir das Problem als probabilistische Graphoptimierung und leiten durch Pilotexperimente drei empirische Erkenntnisse über TTS-Kollaborationsgraphen ab. Angeleitet durch diese Erkenntnisse schlagen wir Agent-REINFORCE vor, ein LLM-Agenten-erweitertes Framework, das die REINFORCE-Pipeline abbildet, indem Sampling-Gradient-Update auf Sampling-Feedback-Update abgebildet wird, wobei Feedback als textueller Gradient dient, um den probabilistischen Graphen zu aktualisieren und effizient nach optimalen Multi-LLM-Kollaborationsgraphen zu suchen. Experimente zeigen, dass Agent-REINFORCE sowohl traditionelle als auch LLM-basierte Baseline-Verfahren in Stichprobeneffizienz und Suchleistung übertrifft und effektiv optimale Graphen unter gemeinsamen Zielvorgaben von Genauigkeit und Inferenzlatenz identifiziert.

English

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

Verallgemeinerung der rechenoptimalen Skalierung zur Testzeit als optimierbarer Graph

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

papers.abstract

Support