LLM 에이전트를 위한 테스트 시간 계산 확장

초록

테스트 시간 계산량 확장은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 있어 놀라운 성공을 거두었습니다. 본 연구에서는 언어 에이전트에 테스트 시간 확장 방법을 적용하는 첫 번째 체계적인 탐구를 수행하고, 이를 통해 그 효과가 얼마나 개선되는지 조사합니다. 구체적으로, 우리는 다양한 테스트 시간 확장 전략을 탐구하며, 이에는 (1) 병렬 샘플링 알고리즘, (2) 순차적 수정 전략, (3) 검증기 및 병합 방법, (4) 롤아웃 다양화 전략이 포함됩니다. 우리는 언어 에이전트에 테스트 시간 확장을 적용할 때 다양한 설계 전략의 영향을 신중하게 분석하고 다음과 같은 결과를 도출했습니다: 1. 테스트 시간 계산량 확장은 에이전트의 성능을 향상시킬 수 있습니다. 2. 에이전트가 언제 반영할지를 아는 것이 중요합니다. 3. 다양한 검증 및 결과 병합 접근법 중 리스트 방식이 가장 우수한 성능을 보입니다. 4. 롤아웃의 다양성을 증가시키는 것은 에이전트의 작업 성능에 긍정적인 영향을 미칩니다.

English

Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.

LLM 에이전트를 위한 테스트 시간 계산 확장

Scaling Test-time Compute for LLM Agents

초록

Support