基於能量的變壓器是可擴展的學習者與思考者
Energy-Based Transformers are Scalable Learners and Thinkers
July 2, 2025
作者: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal
cs.AI
摘要
推理時計算技術,類似於人類的系統二思維,近年來在提升模型性能方面變得流行。然而,現有方法大多存在若干限制:它們或是特定於某種模態(如僅適用於文本)、特定於某類問題(如數學和編程等可驗證領域),或是在無監督預訓練基礎上需要額外的監督/訓練(如驗證器或可驗證獎勵)。本文探討了“能否將這些系統二思維方法推廣,並開發出僅通過無監督學習就能學會思考的模型?”這一問題。有趣的是,我們發現答案是肯定的,方法是學習顯式驗證輸入與候選預測之間的相容性,然後將預測問題重新表述為針對此驗證器的優化問題。具體而言,我們訓練了基於能量的變壓器(EBTs)——一類新的基於能量的模型(EBMs)——為每一對輸入和候選預測分配能量值,從而通過基於梯度下降的能量最小化直至收斂來實現預測。在離散(文本)和連續(視覺)模態中,我們發現EBTs在訓練期間的擴展速度超過了主流的Transformer++方法,在數據、批量大小、參數、浮點運算次數和深度方面實現了高達35%的擴展率提升。在推理時,EBTs在語言任務上通過系統二思維將性能提升了29%,超過了Transformer++;在圖像去噪任務上,EBTs在使用更少前向傳遞的情況下,表現優於擴散變壓器。此外,我們發現,在相同或更差的預訓練性能條件下,EBTs在大多數下游任務上取得了比現有模型更好的結果,這表明EBTs比現有方法具有更好的泛化能力。因此,EBTs為擴展模型的學習與思維能力提供了一個有前景的新範式。
English
Inference-time computation techniques, analogous to human System 2 Thinking,
have recently become popular for improving model performances. However, most
existing approaches suffer from several limitations: they are modality-specific
(e.g., working only in text), problem-specific (e.g., verifiable domains like
math and coding), or require additional supervision/training on top of
unsupervised pretraining (e.g., verifiers or verifiable rewards). In this
paper, we ask the question "Is it possible to generalize these System 2
Thinking approaches, and develop models that learn to think solely from
unsupervised learning?" Interestingly, we find the answer is yes, by learning
to explicitly verify the compatibility between inputs and
candidate-predictions, and then re-framing prediction problems as optimization
with respect to this verifier. Specifically, we train Energy-Based Transformers
(EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy
value to every input and candidate-prediction pair, enabling predictions
through gradient descent-based energy minimization until convergence. Across
both discrete (text) and continuous (visual) modalities, we find EBTs scale
faster than the dominant Transformer++ approach during training, achieving an
up to 35% higher scaling rate with respect to data, batch size, parameters,
FLOPs, and depth. During inference, EBTs improve performance with System 2
Thinking by 29% more than the Transformer++ on language tasks, and EBTs
outperform Diffusion Transformers on image denoising while using fewer forward
passes. Further, we find that EBTs achieve better results than existing models
on most downstream tasks given the same or worse pretraining performance,
suggesting that EBTs generalize better than existing approaches. Consequently,
EBTs are a promising new paradigm for scaling both the learning and thinking
capabilities of models.