通過想像力、搜索和批判來實現LLM自我改進

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

April 18, 2024

作者: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, Dong Yu

cs.AI

摘要

儘管大型語言模型（LLMs）在各種任務上展現出令人印象深刻的能力，但在涉及複雜推理和規劃的情境下仍然面臨困難。最近的研究提出了先進的提示技術以及利用高質量數據進行微調以增強LLMs的推理能力的必要性。然而，這些方法在本質上受到數據可用性和質量的限制。鑑於此，自我校正和自我學習成為可行的解決方案，採用允許LLMs改進其輸出並從自評獎勵中學習的策略。然而，LLMs在自我改進其回應方面的效力，特別是在複雜的推理和規劃任務中，仍然存在疑問。在本文中，我們介紹了AlphaLLM用於改進LLMs的自我改進，該方法將蒙特卡洛樹搜索（MCTS）與LLMs相結合，建立自我改進循環，從而增強LLMs的能力，而無需額外的標註。受AlphaGo成功的啟發，AlphaLLM解決了將MCTS與LLM結合進行自我改進所面臨的獨特挑戰，包括數據稀缺性、語言任務搜索空間的廣闊性，以及語言任務中反饋的主觀性。AlphaLLM由提示綜合組件、針對語言任務量身定制的高效MCTS方法以及三個評論模型組成，用於提供精確的反饋。我們在數學推理任務中的實驗結果表明，AlphaLLM顯著提高了LLMs的性能，而無需額外的標註，展示了LLMs自我改進的潛力。

English

Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.