熵最小化在LLM推理中的非凡效力

摘要

熵最小化（EM）训练模型，使其在最具信心的输出上集中更多的概率质量。我们证明，仅这一简单目标，无需任何标注数据，就能显著提升大语言模型（LLMs）在数学、物理和编程等挑战性任务上的表现。我们探索了三种方法：（1）EM-FT类似于指令微调，最小化标记级别的熵，但针对模型生成的无标注输出；（2）EM-RL：以负熵为唯一奖励进行强化学习；（3）EM-INF：在推理时调整对数概率以减少熵，无需任何训练数据或参数更新。在Qwen-7B上，EM-RL无需任何标注数据，其性能与基于6万标注样本训练的GRPO和RLOO等强基线相当甚至更优。此外，EM-INF使Qwen-32B在SciCode基准测试中，与GPT-4o、Claude 3 Opus和Gemini 1.5 Pro等专有模型匹敌或超越，同时比自一致性和顺序优化方法效率高出3倍。我们的发现揭示，许多预训练LLMs具备先前未被充分认识的推理能力，仅通过熵最小化即可有效激发，无需任何标注数据甚至参数更新。

English

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

熵最小化在LLM推理中的非凡效力

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

摘要

Support