在一片泥土地裡的兩隻長頸鹿:使用遊戲玩法探究大型多模型中的情境建模。
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
June 20, 2024
作者: Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen
cs.AI
摘要
儘管僅限文字的模型情況有所改善,但目前似乎又是多模式(文字和圖像)模型的發展速度超過評估方式的情況。在本文中,我們將從文字模型中引入一種最近發展的評估範式到多模式模型中,即通過目標導向遊戲(自我)遊玩進行評估,以補充基於參考和基於偏好的評估。具體來說,我們定義了挑戰模型從性能來自視覺信息中呈現情況並通過對話對齊這些呈現的遊戲。我們發現,最大的封閉模型在我們定義的遊戲中表現相當不錯,而即使是最好的開放權重模型也會遇到困難。在進一步分析中,我們發現最大模型的卓越深度標註能力推動了部分性能。對於這兩種模型,仍有提升的空間,確保基準的持續相關性。
English
While the situation has improved for text-only models, it again seems to be
the case currently that multimodal (text and image) models develop faster than
ways to evaluate them. In this paper, we bring a recently developed evaluation
paradigm from text models to multimodal models, namely evaluation through the
goal-oriented game (self) play, complementing reference-based and
preference-based evaluation. Specifically, we define games that challenge a
model's capability to represent a situation from visual information and align
such representations through dialogue. We find that the largest closed models
perform rather well on the games that we define, while even the best
open-weight models struggle with them. On further analysis, we find that the
exceptional deep captioning capabilities of the largest models drive some of
the performance. There is still room to grow for both kinds of models, ensuring
the continued relevance of the benchmark.Summary
AI-Generated Summary