基於認知啟發的能量基世界模型

摘要

訓練世界模型的主要方法之一是在序列的輸出空間中進行自回歸預測，預測下一個元素。在自然語言處理（NLP）中，這以大型語言模型（LLMs）預測下一個標記的形式呈現；在計算機視覺（CV）中，這以自回歸模型預測下一幀/標記/像素的形式呈現。然而，這種方法在幾個方面與人類認知不同。首先，人類對未來的預測積極影響內部認知過程。其次，人類自然地評估關於未來狀態的預測是否合理。基於這種能力，第三，通過評估何時預測足夠，人類分配動態時間來進行預測。這種適應性過程類似於心理學中的系統2思維。所有這些能力對於人類在高層次推理和規劃方面的成功至關重要。因此，為了解決傳統自回歸模型缺乏這些類似人類能力的局限性，我們引入了基於能量的世界模型（EBWM）。EBWM涉及訓練一個基於能量的模型（EBM）來預測給定上下文和預測未來狀態的相容性。通過這樣做，EBWM使模型能夠實現所描述的人類認知的所有三個方面。此外，我們開發了一種針對基於能量模型量身定制的傳統自回歸變壓器，稱為基於能量的變壓器（EBT）。我們的結果表明，在CV中，EBWM與傳統自回歸變壓器相比，隨著數據和GPU時間的增加，性能更好，並且在NLP中，EBWM提供了有前途的早期擴展。因此，這種方法為訓練未來能夠進行系統2思維並智能搜索狀態空間的模型開辟了一條令人興奮的道路。

English

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

基於認知啟發的能量基世界模型

Cognitively Inspired Energy-Based World Models

摘要

Support