熵最小化在大型語言模型推理中的非凡效力

摘要

熵最小化（EM）訓練模型，使其在最具信心的輸出上集中更多的概率質量。我們證明，僅此簡單目標，無需任何標註數據，即可大幅提升大型語言模型（LLMs）在數學、物理和編碼等挑戰性任務上的表現。我們探討了三種方法：(1) EM-FT 類似於指令微調，最小化標記級別的熵，但針對模型生成的未標註輸出；(2) EM-RL：強化學習，以負熵作為唯一獎勵進行最大化；(3) EM-INF：推理時對數概率調整，以減少熵，無需任何訓練數據或參數更新。在Qwen-7B上，EM-RL在無任何標註數據的情況下，達到了與GRPO和RLOO等基於60K標註樣本訓練的強力RL基線相當或更優的性能。此外，EM-INF使Qwen-32B在SciCode基準測試中，能夠匹配或超越GPT-4o、Claude 3 Opus和Gemini 1.5 Pro等專有模型的表現，同時比自洽性和序列精煉方法效率高出3倍。我們的研究發現，許多預訓練的LLMs具備先前未被充分認識的推理能力，這些能力僅通過熵最小化即可有效激發，無需任何標註數據甚至參數更新。

English

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

熵最小化在大型語言模型推理中的非凡效力

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

摘要

Support