認知科学に着想を得たエネルギーベースの世界モデル

要旨

世界モデルを訓練する主要な方法の一つは、シーケンスの次の要素を出力空間で自己回帰的に予測することです。自然言語処理（NLP）では、これは大規模言語モデル（LLM）が次のトークンを予測する形で現れます。コンピュータビジョン（CV）では、自己回帰モデルが次のフレーム/トークン/ピクセルを予測する形で現れます。しかし、このアプローチは人間の認知と幾つかの点で異なります。第一に、人間の未来に関する予測は内部の認知プロセスに積極的に影響を与えます。第二に、人間は自然に未来の状態に関する予測の妥当性を評価します。この能力に基づいて、第三に、予測が十分であるかを判断することで、人間は予測に動的な時間を割り当てます。この適応的なプロセスは、心理学におけるシステム2思考に類似しています。これらの能力はすべて、人間が高レベルの推論と計画を成功させるために基本的なものです。したがって、これらの人間のような能力を欠く従来の自己回帰モデルの限界に対処するために、我々はエネルギーベース世界モデル（EBWM）を導入します。EBWMは、与えられたコンテキストと予測された未来の状態の適合性を予測するためにエネルギーベースモデル（EBM）を訓練することを含みます。これにより、EBWMは前述した人間の認知の三つの側面すべてをモデルに実現させます。さらに、我々はエネルギーベースモデルに特化した従来の自己回帰トランスフォーマーの変種を開発し、エネルギーベーストランスフォーマー（EBT）と名付けました。我々の結果は、EBWMがCVにおいて従来の自己回帰トランスフォーマーよりもデータとGPU時間に対してスケーリングが優れていること、そしてEBWMがNLPにおいて有望な初期スケーリングを示すことを実証しています。したがって、このアプローチは、システム2思考を可能にし、状態空間をインテリジェントに探索する未来のモデルを訓練するためのエキサイティングな道を提供します。

English

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

認知科学に着想を得たエネルギーベースの世界モデル

Cognitively Inspired Energy-Based World Models

要旨

Support