LLM 추론에서 엔트로피 최소화의 비합리적 효용성

초록

엔트로피 최소화(EM)는 모델이 가장 확신하는 출력에 더 많은 확률 질량을 집중하도록 훈련시킵니다. 우리는 이 간단한 목표만으로도, 레이블된 데이터 없이도, 대규모 언어 모델(LLMs)의 수학, 물리학, 코딩 과제에서의 성능을 크게 향상시킬 수 있음을 보여줍니다. 우리는 세 가지 접근 방식을 탐구했습니다: (1) EM-FT는 명령어 미세 조정과 유사하게 토큰 수준의 엔트로피를 최소화하지만, 모델에서 생성된 레이블 없는 출력에 대해 수행합니다; (2) EM-RL: 음의 엔트로피를 유일한 보상으로 극대화하는 강화 학습; (3) EM-INF: 훈련 데이터나 매개변수 업데이트 없이 엔트로피를 줄이기 위한 추론 시점 로짓 조정. Qwen-7B에서 EM-RL은 레이블된 데이터 없이도 60K 레이블된 예제로 훈련된 GRPO 및 RLOO와 같은 강력한 RL 베이스라인과 비슷하거나 더 나은 성능을 달성했습니다. 또한, EM-INF는 Qwen-32B가 SciCode 벤치마크에서 GPT-4o, Claude 3 Opus, Gemini 1.5 Pro와 같은 독점 모델의 성능을 맞추거나 능가할 수 있게 하면서, 자기 일관성 및 순차적 정제보다 3배 더 효율적입니다. 우리의 연구 결과는 많은 사전 훈련된 LLM들이 이전에 과소평가된 추론 능력을 가지고 있으며, 레이블된 데이터나 매개변수 업데이트 없이도 엔트로피 최소화만으로 효과적으로 이끌어낼 수 있음을 보여줍니다.

English

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

LLM 추론에서 엔트로피 최소화의 비합리적 효용성

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

초록

Support