에너지 기반 트랜스포머는 확장 가능한 학습자이자 사고자이다.

초록

인간의 시스템 2 사고(System 2 Thinking)와 유사한 추론 시점 계산 기술이 최근 모델 성능 향상을 위해 주목받고 있다. 그러나 대부분의 기존 접근법은 몇 가지 한계를 가지고 있다: 특정 모달리티에만 적용 가능(예: 텍스트만 작동), 특정 문제에만 적용 가능(예: 수학 및 코딩과 같이 검증 가능한 도메인), 또는 비지도 사전 학습 위에 추가적인 지도/훈련이 필요(예: 검증기 또는 검증 가능한 보상)하다는 점이다. 본 논문에서 우리는 "이러한 시스템 2 사고 접근법을 일반화하고, 비지도 학습만으로 사고하는 모델을 개발할 수 있는가?"라는 질문을 던진다. 흥미롭게도, 우리는 입력과 후보 예측 간의 호환성을 명시적으로 검증하는 방법을 학습하고, 이를 통해 예측 문제를 이 검증기를 기준으로 한 최적화 문제로 재구성함으로써 그 답이 '예'임을 발견했다. 구체적으로, 우리는 에너지 기반 모델(EBMs)의 새로운 클래스인 에너지 기반 트랜스포머(EBTs)를 훈련시켜 모든 입력과 후보 예측 쌍에 에너지 값을 할당하고, 수렴할 때까지 경사 하강법 기반 에너지 최소화를 통해 예측을 가능하게 했다. 이산(텍스트) 및 연속(시각) 모달리티 모두에서, EBTs는 훈련 중에 지배적인 트랜스포머++ 접근법보다 더 빠르게 확장되며, 데이터, 배치 크기, 매개변수, FLOPs, 깊이에 대해 최대 35% 더 높은 확장률을 달성했다. 추론 시점에서, EBTs는 언어 작업에서 트랜스포머++보다 29% 더 나은 시스템 2 사고 성능을 보였으며, 이미지 노이즈 제거에서는 더 적은 순방향 패스로 디퓨전 트랜스포머를 능가했다. 또한, EBTs는 동일하거나 더 나쁜 사전 학습 성능을 가진 기존 모델보다 대부분의 다운스트림 작업에서 더 나은 결과를 달성했으며, 이는 EBTs가 기존 접근법보다 더 잘 일반화함을 시사한다. 결과적으로, EBTs는 모델의 학습 및 사고 능력을 확장하기 위한 유망한 새로운 패러다임이다.

English

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

에너지 기반 트랜스포머는 확장 가능한 학습자이자 사고자이다.

Energy-Based Transformers are Scalable Learners and Thinkers

초록

Support