在一个泥土场地上的两只长颈鹿：利用游戏玩法研究大型多模态模型中的情境建模

摘要

尽管纯文本模型的情况有所改善，但目前似乎又出现了多模态（文本和图像）模型的发展速度快于评估方法的情况。在本文中，我们将最近发展起来的文本模型评估范式引入到多模态模型中，即通过目标导向游戏（自我）对抗来进行评估，以补充基于参考和基于偏好的评估。具体而言，我们定义了一些挑战模型从视觉信息中表示情境并通过对话对齐这些表示的游戏。我们发现，最大的封闭模型在我们定义的游戏中表现相当不错，而即使是最好的开放权重模型也很难应对。进一步分析后，我们发现最大模型异常出色的深度字幕能力推动了部分性能。对于这两种模型，仍有提升空间，确保基准的持续相关性。

English

While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.

在一个泥土场地上的两只长颈鹿：利用游戏玩法研究大型多模态模型中的情境建模

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

摘要

Support