基於能量的變壓器是可擴展的學習者與思考者

摘要

推理時計算技術，類似於人類的系統二思維，近年來在提升模型性能方面變得流行。然而，現有方法大多存在若干限制：它們或是特定於某種模態（如僅適用於文本）、特定於某類問題（如數學和編程等可驗證領域），或是在無監督預訓練基礎上需要額外的監督/訓練（如驗證器或可驗證獎勵）。本文探討了“能否將這些系統二思維方法推廣，並開發出僅通過無監督學習就能學會思考的模型？”這一問題。有趣的是，我們發現答案是肯定的，方法是學習顯式驗證輸入與候選預測之間的相容性，然後將預測問題重新表述為針對此驗證器的優化問題。具體而言，我們訓練了基於能量的變壓器（EBTs）——一類新的基於能量的模型（EBMs）——為每一對輸入和候選預測分配能量值，從而通過基於梯度下降的能量最小化直至收斂來實現預測。在離散（文本）和連續（視覺）模態中，我們發現EBTs在訓練期間的擴展速度超過了主流的Transformer++方法，在數據、批量大小、參數、浮點運算次數和深度方面實現了高達35%的擴展率提升。在推理時，EBTs在語言任務上通過系統二思維將性能提升了29%，超過了Transformer++；在圖像去噪任務上，EBTs在使用更少前向傳遞的情況下，表現優於擴散變壓器。此外，我們發現，在相同或更差的預訓練性能條件下，EBTs在大多數下游任務上取得了比現有模型更好的結果，這表明EBTs比現有方法具有更好的泛化能力。因此，EBTs為擴展模型的學習與思維能力提供了一個有前景的新範式。

English

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

基於能量的變壓器是可擴展的學習者與思考者

Energy-Based Transformers are Scalable Learners and Thinkers

摘要

Support