원샷 엔트로피 최소화

초록

우리는 13,440개의 대규모 언어 모델을 학습시켰으며, 엔트로피 최소화가 단일 무라벨 데이터와 10단계의 최적화만으로도 규칙 기반 강화 학습에서 수천 개의 데이터와 신중하게 설계된 보상을 사용하여 얻은 성능 향상에 필적하거나 그 이상의 결과를 달성할 수 있음을 발견했습니다. 이 놀라운 결과는 대규모 언어 모델의 사후 학습 패러다임에 대한 재고를 촉발할 수 있습니다. 우리의 코드는 https://github.com/zitian-gao/one-shot-em에서 확인할 수 있습니다.

English

We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.

원샷 엔트로피 최소화

One-shot Entropy Minimization

초록

Support