基于能量的Transformer模型是可扩展的学习与思考者

摘要

与人类系统二思维相类似的推理时计算技术，近期在提升模型性能方面广受欢迎。然而，现有方法大多存在若干局限：它们或局限于特定模态（如仅适用于文本），或针对特定问题（如数学和编程等可验证领域），或需在无监督预训练基础上额外引入监督/训练（如验证器或可验证奖励）。本文探讨了一个核心问题：“能否推广这些系统二思维方法，开发出仅通过无监督学习就能学会思考的模型？”有趣的是，我们发现答案是肯定的，关键在于学习如何显式验证输入与候选预测之间的兼容性，并将预测问题重新表述为针对该验证器的优化问题。具体而言，我们训练了基于能量的Transformer（EBTs）——一类新型的基于能量的模型（EBMs）——为每一对输入和候选预测赋予能量值，从而通过基于梯度下降的能量最小化直至收敛来实现预测。在离散（文本）和连续（视觉）模态上，我们发现EBTs在训练期间比主流的Transformer++方法扩展得更快，在数据、批量大小、参数、浮点运算次数和深度方面实现了高达35%的扩展率提升。在推理阶段，EBTs在语言任务上通过系统二思维将性能提升了29%，超越了Transformer++；同时，在图像去噪任务上，EBTs以更少的前向传递次数超越了扩散Transformer。此外，我们发现，在相同或更差的预训练性能下，EBTs在多数下游任务上均优于现有模型，表明EBTs比现有方法具有更好的泛化能力。因此，EBTs为扩展模型的学习与思考能力提供了一个极具前景的新范式。

English

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

基于能量的Transformer模型是可扩展的学习与思考者

Energy-Based Transformers are Scalable Learners and Thinkers

摘要

Support