基于认知启发的能量驱动世界模型
Cognitively Inspired Energy-Based World Models
June 13, 2024
作者: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Aman Chadha, Jundong Li, Tariq Iqbal
cs.AI
摘要
训练世界模型的主要方法之一是在序列的输出空间中进行自回归预测,预测下一个元素。在自然语言处理(NLP)中,这体现为大型语言模型(LLMs)预测下一个标记;在计算机视觉(CV)中,这体现为自回归模型预测下一个帧/标记/像素。然而,这种方法在几个方面与人类认知不同。首先,人类对未来的预测会积极影响内部认知过程。其次,人类自然地评估关于未来状态的预测的合理性。基于这种能力,第三,通过评估何时预测足够,人类分配动态时间量来进行预测。这种自适应过程类似于心理学中的系统2思维。所有这些能力对于人类在高级推理和规划方面的成功至关重要。因此,为了解决传统自回归模型缺乏这些类似人类能力的局限性,我们引入了基于能量的世界模型(EBWM)。EBWM涉及训练一个基于能量的模型(EBM)来预测给定上下文和预测未来状态的兼容性。通过这样做,EBWM使模型能够实现所描述的人类认知的所有三个方面。此外,我们开发了一种针对基于能量模型量身定制的传统自回归变压器变种,称为基于能量的变压器(EBT)。我们的结果表明,在CV领域,EBWM比传统自回归变压器更好地随着数据和GPU小时数扩展,并且在NLP领域,EBWM提供了有前途的早期扩展。因此,这种方法为训练未来能够进行系统2思维并智能搜索状态空间的模型打开了一条令人兴奋的道路。
English
One of the predominant methods for training world models is autoregressive
prediction in the output space of the next element of a sequence. In Natural
Language Processing (NLP), this takes the form of Large Language Models (LLMs)
predicting the next token; in Computer Vision (CV), this takes the form of
autoregressive models predicting the next frame/token/pixel. However, this
approach differs from human cognition in several respects. First, human
predictions about the future actively influence internal cognitive processes.
Second, humans naturally evaluate the plausibility of predictions regarding
future states. Based on this capability, and third, by assessing when
predictions are sufficient, humans allocate a dynamic amount of time to make a
prediction. This adaptive process is analogous to System 2 thinking in
psychology. All these capabilities are fundamental to the success of humans at
high-level reasoning and planning. Therefore, to address the limitations of
traditional autoregressive models lacking these human-like capabilities, we
introduce Energy-Based World Models (EBWM). EBWM involves training an
Energy-Based Model (EBM) to predict the compatibility of a given context and a
predicted future state. In doing so, EBWM enables models to achieve all three
facets of human cognition described. Moreover, we developed a variant of the
traditional autoregressive transformer tailored for Energy-Based models, termed
the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales
better with data and GPU Hours than traditional autoregressive transformers in
CV, and that EBWM offers promising early scaling in NLP. Consequently, this
approach offers an exciting path toward training future models capable of
System 2 thinking and intelligently searching across state spaces.