基于认知启发的能量驱动世界模型

摘要

训练世界模型的主要方法之一是在序列的输出空间中进行自回归预测，预测下一个元素。在自然语言处理（NLP）中，这体现为大型语言模型（LLMs）预测下一个标记；在计算机视觉（CV）中，这体现为自回归模型预测下一个帧/标记/像素。然而，这种方法在几个方面与人类认知不同。首先，人类对未来的预测会积极影响内部认知过程。其次，人类自然地评估关于未来状态的预测的合理性。基于这种能力，第三，通过评估何时预测足够，人类分配动态时间量来进行预测。这种自适应过程类似于心理学中的系统2思维。所有这些能力对于人类在高级推理和规划方面的成功至关重要。因此，为了解决传统自回归模型缺乏这些类似人类能力的局限性，我们引入了基于能量的世界模型（EBWM）。EBWM涉及训练一个基于能量的模型（EBM）来预测给定上下文和预测未来状态的兼容性。通过这样做，EBWM使模型能够实现所描述的人类认知的所有三个方面。此外，我们开发了一种针对基于能量模型量身定制的传统自回归变压器变种，称为基于能量的变压器（EBT）。我们的结果表明，在CV领域，EBWM比传统自回归变压器更好地随着数据和GPU小时数扩展，并且在NLP领域，EBWM提供了有前途的早期扩展。因此，这种方法为训练未来能够进行系统2思维并智能搜索状态空间的模型打开了一条令人兴奋的道路。

English

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

基于认知启发的能量驱动世界模型

Cognitively Inspired Energy-Based World Models

摘要

Support