AgentTTS: 복잡한 작업에서 테스트 시점 계산 최적화 스케일링 전략을 위한 대형 언어 모델 에이전트

초록

테스트 시간 스케일링(TTS)은 추론 과정에서 추가적인 컴퓨팅 자원을 할당함으로써 대규모 언어 모델(LLM)의 성능을 향상시킨다. 그러나 기존 연구는 주로 단일 단계 작업에서 TTS를 탐구하는 반면, 많은 실제 문제는 이질적인 하위 작업으로 구성된 다단계 복잡 작업으로, 각 하위 작업은 특정 능력을 갖춘 LLM을 필요로 한다. 따라서 우리는 다단계 복잡 작업에서 테스트 시간 컴퓨팅 최적 스케일링이라는 새로운 문제를 연구하여, 적합한 모델을 선택하고 하위 작업별 예산을 할당하여 전체 성능을 극대화하는 것을 목표로 한다. 다단계 작업에서의 TTS는 두 가지 근본적인 도전 과제를 제시한다: (i) 모델 및 예산 할당의 조합적 탐색 공간과 추론의 높은 비용으로 인해 무차별 대입 탐색이 비현실적이다. (ii) 하위 작업 간 최적의 모델 및 예산 할당은 상호 의존적이어서 컴퓨팅 최적 탐색의 복잡성을 증가시킨다. 이러한 격차를 해결하기 위해, 우리는 6개의 데이터셋에 걸친 4가지 작업에 대한 광범위한 파일럿 실험을 수행하여 다단계 복잡 작업에서 LLM의 행동을 특징짓는 세 가지 경험적 통찰을 도출했다. 이러한 통찰을 바탕으로, 우리는 실행 환경과의 반복적 피드백 기반 상호작용을 통해 컴퓨팅 최적 할당을 자율적으로 탐색하는 LLM 에이전트 기반 프레임워크인 AgentTTS를 제안한다. 실험 결과는 AgentTTS가 전통적인 및 기타 LLM 기반 베이스라인 대비 탐색 효율성에서 크게 우수하며, 다양한 훈련 세트 크기에 대한 강건성과 해석 가능성이 향상되었음을 보여준다.

English

Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

AgentTTS: 복잡한 작업에서 테스트 시점 계산 최적화 스케일링 전략을 위한 대형 언어 모델 에이전트

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

초록

Support