单次熵最小化

摘要

我们训练了13,440个大型语言模型，发现熵最小化仅需单个未标注数据和10步优化，就能达到甚至超越基于规则的强化学习中使用数千数据和精心设计奖励所获得的性能提升。这一惊人发现可能促使我们重新思考大型语言模型的训练后优化范式。我们的代码已发布于https://github.com/zitian-gao/one-shot-em。

English

We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.

单次熵最小化

One-shot Entropy Minimization

摘要

Support