言語モデルは隠れた推論者である：自己報酬を通じて潜在的な推論能力を解放する

要旨

大規模言語モデル（LLMs）は印象的な能力を示していますが、複数のステップを必要とする複雑な推論タスクにはまだ苦労しています。Chain-of-Thought（CoT）などのプロンプトベースの手法は、推論時にLLMの推論を改善できますが、トレーニング中の推論能力の最適化は依然として難しいです。本研究では、推論を潜在分布からのサンプリングとして定式化し、変分アプローチを用いて最適化する原則に基づくフレームワークであるLaTent Reasoning Optimization（LaTRO）を紹介します。LaTROは、外部フィードバックや報酬モデルを必要とせずに、LLMsが推論プロセスと推論品質の評価能力を同時に向上させることを可能にします。我々は、Phi-3.5-mini、Mistral-7B、およびLlama-3.1-8Bを含む複数のモデルアーキテクチャを使用して、GSM8KとARC-Challengeデータセットでの実験によってLaTROを検証します。GSM8Kでは、LaTROはベースモデルに比べてゼロショットの精度を平均12.5％向上させ、教師付きファインチューニングに比べて9.6％向上させました。我々の調査結果は、事前学習されたLLMsが潜在的な推論能力を持っており、提案された最適化手法によって自己改善的な方法でそれを引き出し強化できることを示唆しています。LaTROのコードは、https://github.com/SalesforceAIResearch/LaTRO で入手可能です。

English

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

言語モデルは隠れた推論者である：自己報酬を通じて潜在的な推論能力を解放する

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

要旨

Support