Genius: 高度な推論のための汎用性のある純粋教師なし自己学習フレームワーク

要旨

大規模言語モデル（LLM）の推論能力の向上は、広く注目を集めています。しかし、現在のポストトレーニング技術は、結果の監視や補助的な報酬モデルなどの監視信号に大きく依存しており、スケーラビリティの問題や高いアノテーションコストが課題となっています。これにより、外部の監視を必要とせずにLLMの推論を強化する動機が生まれました。私たちは、汎用的で純粋に教師なしの自己学習フレームワーク「Genius」を導入します。外部の補助を必要とせず、Geniusは段階的に最適な応答シーケンスを探索し、LLMを最適化する必要があります。潜在的なステップを探求し、最適なステップを活用するために、Geniusは段階的先見リサンプリング戦略を導入し、将来の結果をシミュレートしてステップの価値を推定します。さらに、教師なし設定では避けられない本質的なノイズと不確実性が生じることを認識しています。堅牢な最適化を提供するために、推定の不整合を緩和するためのアドバンテージ校正最適化（ACO）損失関数を提案します。これらの技術を組み合わせることで、Geniusは一般的なクエリに対して監視なしでLLMの推論を自己改善するための先進的な最初のステップを提供し、一般的なクエリの膨大な可用性を考慮した推論スケーリング則を革新します。コードはhttps://github.com/xufangzhi/Geniusで公開されます。

English

Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.

Genius: 高度な推論のための汎用性のある純粋教師なし自己学習フレームワーク

Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

要旨

Support