エネルギーに基づくトランスフォーマーはスケーラブルな学習者および思考者である

要旨

推論時の計算技術は、人間のシステム2思考に類似したものとして、最近モデルの性能向上のために注目を集めている。しかし、既存の手法の多くはいくつかの制約を抱えている。それらはモダリティ特化型（例えばテキストのみに適用可能）、問題特化型（例えば数学やコーディングのような検証可能な領域）、あるいは教師なし事前学習に加えて追加の監督やトレーニングを必要とする（例えば検証器や検証可能な報酬）といったものである。本論文では、「これらのシステム2思考アプローチを一般化し、教師なし学習のみから思考を学ぶモデルを開発することは可能か？」という問いを立てる。興味深いことに、入力と候補予測の間の互換性を明示的に検証することを学び、その後予測問題をこの検証器に対する最適化として再構築することで、その答えが「はい」であることを見出した。具体的には、エネルギーベースモデル（EBM）の新しいクラスであるエネルギーベーストランスフォーマー（EBT）を訓練し、すべての入力と候補予測のペアにエネルギー値を割り当てることで、勾配降下法に基づくエネルギー最小化を通じて収束するまで予測を行うことを可能にした。離散的（テキスト）および連続的（視覚）なモダリティの両方において、EBTはトレーニング中に支配的なTransformer++アプローチよりも速くスケーリングし、データ、バッチサイズ、パラメータ、FLOPs、深さに関して最大35%高いスケーリング率を達成した。推論時には、EBTは言語タスクにおいてTransformer++よりも29%多くシステム2思考による性能向上を実現し、画像ノイズ除去においてはDiffusion Transformerを上回りながらも少ないフォワードパスを使用した。さらに、EBTは同じまたはそれ以下の事前学習性能でも、ほとんどの下流タスクにおいて既存のモデルよりも良い結果を達成し、EBTが既存のアプローチよりも一般化能力が高いことを示唆している。したがって、EBTはモデルの学習能力と思考能力の両方をスケーリングするための有望な新しいパラダイムである。

English

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

エネルギーに基づくトランスフォーマーはスケーラブルな学習者および思考者である

Energy-Based Transformers are Scalable Learners and Thinkers

要旨

Support