ワンショットエントロピー最小化

要旨

13,440の大規模言語モデルを訓練した結果、エントロピー最小化にはわずか1つのラベルなしデータと10ステップの最適化のみで、ルールベースの強化学習において数千のデータと慎重に設計された報酬を用いて得られる性能向上に匹敵する、あるいはそれ以上の改善が達成可能であることが判明しました。この驚くべき結果は、大規模言語モデルのポストトレーニングパラダイムの再考を促す可能性があります。コードはhttps://github.com/zitian-gao/one-shot-emで公開されています。

English

We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.

ワンショットエントロピー最小化

One-shot Entropy Minimization

要旨

Support