語言模型是隱藏的推理者：通過自我獎勵揭示潛在的推理能力

摘要

大型語言模型（LLMs）展現了令人印象深刻的能力，但在需要多步驟的複雜推理任務上仍然面臨困難。儘管像“Chain-of-Thought”（CoT）這樣的提示式方法可以改善LLM在推論時的推理能力，但在訓練期間優化推理能力仍然具有挑戰性。我們引入了LaTent Reasoning Optimization（LaTRO），這是一個原則性框架，將推理定義為從潛在分布中取樣並通過變分方法進行優化。LaTRO使LLMs能夠同時改善其推理過程和評估推理質量的能力，而無需外部反饋或獎勵模型。我們通過在GSM8K和ARC-Challenge數據集上使用多種模型架構的實驗來驗證LaTRO。在GSM8K上，LaTRO將零-shot準確性平均提高了12.5％，比基礎模型提高了9.6％，超過了Phi-3.5-mini、Mistral-7B和Llama-3.1-8B的監督微調。我們的研究結果表明，預訓練的LLMs具有潛在的推理能力，可以通過我們提出的自我改進方法進行解鎖和增強。LaTRO的代碼可在https://github.com/SalesforceAIResearch/LaTRO 上找到。

English

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

語言模型是隱藏的推理者：通過自我獎勵揭示潛在的推理能力

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

摘要

Support