多模态基础模型中从动力学模型引导世界模型的构建

摘要

视觉-语言基础模型在多大程度上具备真实的世界模型（观察 × 动作 → 观察）和动态模型（观察 × 观察 → 动作），尤其是在动作通过语言表达时？尽管开源基础模型在这两方面都存在困难，但我们发现，通过监督微调使其获得动态模型要比获得世界模型容易得多。反过来，动态模型可以通过两种主要策略来引导世界模型的构建：1）从合成数据中进行弱监督学习；2）推理时验证。首先，动态模型可以为未标记的视频帧观察对标注动作，从而扩展训练数据。我们进一步提出了一种新的目标函数，其中观察对中的图像标记根据识别模型预测的重要性进行加权。其次，动态模型可以为世界模型的多个样本分配奖励以进行评分，从而在推理时有效指导搜索。我们通过在Aurora-Bench上的动作中心图像编辑任务来评估这两种策略产生的世界模型。我们的最佳模型在性能上与最先进的图像编辑模型相当，根据GPT4o作为评判标准，在真实世界子集上提升了15%，并在Aurora-Bench的所有子集上获得了最佳的人类评价平均分。

English

To what extent do vision-and-language foundation models possess a realistic world model (observation times action rightarrow observation) and a dynamics model (observation times observation rightarrow action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

多模态基础模型中从动力学模型引导世界模型的构建

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

摘要

Support