테스트 시간 계산 최적화 스케일링의 일반화 및 최적화 가능한 그래프로서의 표현

초록

테스트 타임 스케일링(TTS)은 추론 과정에서 병렬, 순차 또는 하이브리드 스케일링 방식을 통해 추가적인 계산 자원을 할당함으로써 대규모 언어 모델(LLM)의 성능을 향상시킵니다. 그러나 기존 연구들은 고정된 협업 구조(예: 토폴로지)와 단일 모델 사용을 전제로 하는 경우가 많아, 작업에 따라 최적의 구조와 모델 조합이 달라질 수 있다는 점을 간과했습니다. 이에 본 연구는 고정된 예산 하에서 TTS 환경에서 계산적으로 최적인 모델 조합과 구조를 탐색하는 새로운 문제를 다룹니다. 우리는 이를 노드가 역할과 LLM 모델 할당을 인코딩하고, 에지가 정보 흐름을 포착하는 다중 LLM 협업 그래프로 형식화합니다. 이 문제는 (i) 조합적 탐색 공간이 매우 크고, (ii) 작업별 요구사항에 맞춤형 설계가 필요하기 때문에 해결이 어렵습니다. 이를 해결하기 위해 우리는 이 문제를 확률적 그래프 최적화 문제로 재정의하고, 파일럿 실험을 통해 TTS 협업 그래프에 대한 세 가지 경험적 통찰을 도출합니다. 이러한 통찰을 바탕으로 우리는 Agent-REINFORCE를 제안합니다. 이는 샘플링-기울기-갱신 과정을 샘플링-피드백-갱신 과정으로 매핑하여 REINFORCE 파이프라인을 모방한 LLM 에이전트 기반 프레임워크로, 피드백은 텍스트 기반 기울기 역할을 하여 확률적 그래프를 갱신하고 최적의 다중 LLM 협업 그래프를 효율적으로 탐색합니다. 실험 결과, Agent-REINFORCE는 샘플 효율성과 탐색 성능에서 기존 방식 및 LLM 기반 베이스라인을 능가하며, 정확도와 추론 지연 시간이라는 복합 목표 하에서 최적의 그래프를 효과적으로 찾아냅니다.

English

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

테스트 시간 계산 최적화 스케일링의 일반화 및 최적화 가능한 그래프로서의 표현

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

초록

Support