マルチモーダル基盤モデルにおけるダイナミクスモデルからのワールドモデルのブートストラップ

要旨

視覚と言語の基盤モデルは、言語を通じて行動が表現される場合、どの程度現実的な世界モデル（観察 × 行動 → 観察）およびダイナミクスモデル（観察 × 観察 → 行動）を備えているのか？オープンソースの基盤モデルは両者に苦戦しているが、教師あり学習を通じてダイナミクスモデルを獲得するためのファインチューニングは、世界モデルを獲得するよりも大幅に容易であることがわかった。さらに、ダイナミクスモデルは、主に2つの戦略を通じて世界モデルをブートストラップするために使用できる：1）合成データからの弱教師あり学習、および2）推論時の検証。まず、ダイナミクスモデルは、ラベル付けされていないビデオフレーム観察のペアに対して行動を注釈付けし、トレーニングデータを拡張することができる。さらに、観察ペア内の画像トークンを認識モデルによって予測された重要度に基づいて重み付けする新しい目的関数を提案する。次に、ダイナミクスモデルは、世界モデルの複数のサンプルに報酬を割り当ててスコアリングし、推論時に検索を効果的に導くことができる。我々は、Aurora-Benchにおける行動中心の画像編集タスクを通じて、両戦略から得られた世界モデルを評価する。我々の最良のモデルは、最先端の画像編集モデルと競合する性能を達成し、GPT4o-as-judgeによる実世界のサブセットにおいて15％の改善を実現し、Aurora-Benchのすべてのサブセットにおいて最高の平均人間評価を達成した。

English

To what extent do vision-and-language foundation models possess a realistic world model (observation times action rightarrow observation) and a dynamics model (observation times observation rightarrow action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

マルチモーダル基盤モデルにおけるダイナミクスモデルからのワールドモデルのブートストラップ

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

要旨

Support