熵最小化在LLM推理中的非凡效力
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
May 21, 2025
作者: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng
cs.AI
摘要
熵最小化(EM)训练模型,使其在最具信心的输出上集中更多的概率质量。我们证明,仅这一简单目标,无需任何标注数据,就能显著提升大语言模型(LLMs)在数学、物理和编程等挑战性任务上的表现。我们探索了三种方法:(1)EM-FT类似于指令微调,最小化标记级别的熵,但针对模型生成的无标注输出;(2)EM-RL:以负熵为唯一奖励进行强化学习;(3)EM-INF:在推理时调整对数概率以减少熵,无需任何训练数据或参数更新。在Qwen-7B上,EM-RL无需任何标注数据,其性能与基于6万标注样本训练的GRPO和RLOO等强基线相当甚至更优。此外,EM-INF使Qwen-32B在SciCode基准测试中,与GPT-4o、Claude 3 Opus和Gemini 1.5 Pro等专有模型匹敌或超越,同时比自一致性和顺序优化方法效率高出3倍。我们的发现揭示,许多预训练LLMs具备先前未被充分认识的推理能力,仅通过熵最小化即可有效激发,无需任何标注数据甚至参数更新。
English
Entropy minimization (EM) trains the model to concentrate even more
probability mass on its most confident outputs. We show that this simple
objective alone, without any labeled data, can substantially improve large
language models' (LLMs) performance on challenging math, physics, and coding
tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy
similarly to instruction finetuning, but on unlabeled outputs drawn from the
model; (2) EM-RL: reinforcement learning with negative entropy as the only
reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce
entropy without any training data or parameter updates. On Qwen-7B, EM-RL,
without any labeled data, achieves comparable or better performance than strong
RL baselines such as GRPO and RLOO that are trained on 60K labeled examples.
Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of
proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the
challenging SciCode benchmark, while being 3x more efficient than
self-consistency and sequential refinement. Our findings reveal that many
pretrained LLMs possess previously underappreciated reasoning capabilities that
can be effectively elicited through entropy minimization alone, without any
labeled data or even any parameter updates.Summary
AI-Generated Summary