通过想象、搜索和批判实现LLM自我改进

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

April 18, 2024

作者: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, Dong Yu

cs.AI

摘要

尽管大型语言模型（LLMs）在各种任务上展现出令人印象深刻的能力，但它们仍然在涉及复杂推理和规划的情景中面临困难。最近的研究提出了先进的提示技术以及利用高质量数据进行微调以增强LLMs的推理能力的必要性。然而，这些方法在本质上受到数据可用性和质量的限制。鉴于此，自我纠正和自我学习成为可行的解决方案，采用允许LLMs改进其输出并从自我评估奖励中学习的策略。然而，LLMs在自我完善其响应方面的效力，特别是在复杂推理和规划任务中，仍然存在疑问。在本文中，我们介绍了AlphaLLM用于改进LLMs的自我方法，它将蒙特卡洛树搜索（MCTS）与LLMs相结合，建立自我改进循环，从而增强LLMs的能力，而无需额外的注释。受AlphaGo成功的启发，AlphaLLM解决了将MCTS与LLM相结合进行自我改进的独特挑战，包括数据稀缺性、语言任务搜索空间的广阔性以及语言任务中反馈的主观性质。AlphaLLM由提示综合组件、专为语言任务量身定制的高效MCTS方法以及三个评论模型组成，用于提供精确的反馈。我们在数学推理任务中的实验结果表明，AlphaLLM显著提升了LLMs的性能，而无需额外的注释，展示了LLMs自我改进的潜力。

English

Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.