基於認知啟發的能量基世界模型
Cognitively Inspired Energy-Based World Models
June 13, 2024
作者: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Aman Chadha, Jundong Li, Tariq Iqbal
cs.AI
摘要
訓練世界模型的主要方法之一是在序列的輸出空間中進行自回歸預測,預測下一個元素。在自然語言處理(NLP)中,這以大型語言模型(LLMs)預測下一個標記的形式呈現;在計算機視覺(CV)中,這以自回歸模型預測下一幀/標記/像素的形式呈現。然而,這種方法在幾個方面與人類認知不同。首先,人類對未來的預測積極影響內部認知過程。其次,人類自然地評估關於未來狀態的預測是否合理。基於這種能力,第三,通過評估何時預測足夠,人類分配動態時間來進行預測。這種適應性過程類似於心理學中的系統2思維。所有這些能力對於人類在高層次推理和規劃方面的成功至關重要。因此,為了解決傳統自回歸模型缺乏這些類似人類能力的局限性,我們引入了基於能量的世界模型(EBWM)。EBWM涉及訓練一個基於能量的模型(EBM)來預測給定上下文和預測未來狀態的相容性。通過這樣做,EBWM使模型能夠實現所描述的人類認知的所有三個方面。此外,我們開發了一種針對基於能量模型量身定制的傳統自回歸變壓器,稱為基於能量的變壓器(EBT)。我們的結果表明,在CV中,EBWM與傳統自回歸變壓器相比,隨著數據和GPU時間的增加,性能更好,並且在NLP中,EBWM提供了有前途的早期擴展。因此,這種方法為訓練未來能夠進行系統2思維並智能搜索狀態空間的模型開辟了一條令人興奮的道路。
English
One of the predominant methods for training world models is autoregressive
prediction in the output space of the next element of a sequence. In Natural
Language Processing (NLP), this takes the form of Large Language Models (LLMs)
predicting the next token; in Computer Vision (CV), this takes the form of
autoregressive models predicting the next frame/token/pixel. However, this
approach differs from human cognition in several respects. First, human
predictions about the future actively influence internal cognitive processes.
Second, humans naturally evaluate the plausibility of predictions regarding
future states. Based on this capability, and third, by assessing when
predictions are sufficient, humans allocate a dynamic amount of time to make a
prediction. This adaptive process is analogous to System 2 thinking in
psychology. All these capabilities are fundamental to the success of humans at
high-level reasoning and planning. Therefore, to address the limitations of
traditional autoregressive models lacking these human-like capabilities, we
introduce Energy-Based World Models (EBWM). EBWM involves training an
Energy-Based Model (EBM) to predict the compatibility of a given context and a
predicted future state. In doing so, EBWM enables models to achieve all three
facets of human cognition described. Moreover, we developed a variant of the
traditional autoregressive transformer tailored for Energy-Based models, termed
the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales
better with data and GPU Hours than traditional autoregressive transformers in
CV, and that EBWM offers promising early scaling in NLP. Consequently, this
approach offers an exciting path toward training future models capable of
System 2 thinking and intelligently searching across state spaces.