상상, 탐색, 비판을 통한 대형 언어 모델의 자기 개선을 향하여

초록

대규모 언어 모델(LLMs)이 다양한 작업에서 인상적인 능력을 보여주고 있지만, 여전히 복잡한 추론과 계획이 필요한 시나리오에서는 어려움을 겪고 있습니다. 최근 연구에서는 LLMs의 추론 능력을 강화하기 위해 고급 프롬프트 기법과 고품질 데이터를 활용한 미세 조정의 필요성을 제안했습니다. 그러나 이러한 접근 방식은 데이터의 가용성과 품질에 의해 본질적으로 제약을 받습니다. 이러한 점을 고려할 때, 자기 수정(self-correction)과 자기 학습(self-learning)은 LLMs가 자신의 출력을 개선하고 자기 평가된 보상으로부터 학습할 수 있는 전략을 통해 실행 가능한 해결책으로 부상하고 있습니다. 하지만, 특히 복잡한 추론 및 계획 작업에서 LLMs가 스스로 응답을 개선하는 데 대한 효율성은 여전히 의심스럽습니다. 본 논문에서는 LLMs의 자기 개선을 위한 AlphaLLM을 소개합니다. AlphaLLM은 몬테카를로 트리 탐색(MCTS)을 LLMs와 통합하여 추가 주석 없이도 LLMs의 능력을 향상시키는 자기 개선 루프를 구축합니다. AlphaGo의 성공에서 영감을 받은 AlphaLLM은 MCTS와 LLM을 결합하여 자기 개선을 달성하는 데 있어 데이터 부족, 언어 작업의 광대한 탐색 공간, 그리고 언어 작업에서의 피드백의 주관적 특성과 같은 고유한 문제를 해결합니다. AlphaLLM은 프롬프트 합성 구성 요소, 언어 작업에 맞춤화된 효율적인 MCTS 접근법, 그리고 정확한 피드백을 제공하기 위한 세 가지 비평 모델로 구성됩니다. 수학적 추론 작업에서의 실험 결과는 AlphaLLM이 추가 주석 없이도 LLMs의 성능을 크게 향상시킬 수 있음을 보여주며, LLMs의 자기 개선 가능성을 입증합니다.

English

Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.

상상, 탐색, 비판을 통한 대형 언어 모델의 자기 개선을 향하여

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

초록

Summary

Support

Support