多宇宙:一個用於評估大型視覺與語言模型的多輪對話基準
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
October 18, 2025
作者: Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, Ho-Jin Choi
cs.AI
摘要
視覺與語言模型(VLMs)在單輪基準測試中展現了令人矚目的能力,然而現實世界的應用往往需要更為複雜的多輪對話。現有的多輪對話數據集(如MMDU、ConvBench)僅部分捕捉了用戶所遭遇的對話場景的廣度與深度。在本研究中,我們引入了MultiVerse,這是一個新穎的多輪對話基準測試,包含647個對話——每個對話平均四輪——源自12個流行的VLM評估基準測試的多樣化集合。MultiVerse涵蓋了484個任務和484個互動目標,主題範圍廣泛,從事實知識與感知到數學與編碼等高級推理任務。為了促進穩健的評估,我們提出了一種基於清單的評估方法,利用GPT-4o作為自動評估器,測量包括感知準確性、語言清晰度和事實正確性在內的37個關鍵方面的表現。我們在MultiVerse上評估了18個VLMs,結果顯示,即便是最強的模型(如GPT-4o)在複雜的多輪對話中也僅能達到50%的成功率,凸顯了該數據集的挑戰性。值得注意的是,我們發現提供完整的對話上下文能顯著提升較小或較弱模型的表現,強調了上下文學習的重要性。我們相信MultiVerse是評估VLMs多輪互動能力的一個重要里程碑。
English
Vision-and-Language Models (VLMs) have shown impressive capabilities on
single-turn benchmarks, yet real-world applications often demand more intricate
multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only
partially capture the breadth and depth of conversational scenarios encountered
by users. In this work, we introduce MultiVerse, a novel multi-turn
conversation benchmark featuring 647 dialogues - each averaging four turns -
derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484
tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from
factual knowledge and perception to advanced reasoning tasks such as
mathematics and coding. To facilitate robust assessment, we propose a
checklist-based evaluation method that leverages GPT-4o as the automated
evaluator, measuring performance across 37 key aspects, including perceptual
accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on
MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve
only a 50% success rate in complex multi-turn conversations, highlighting the
dataset's challenging nature. Notably, we find that providing full dialogue
context significantly enhances performance for smaller or weaker models,
emphasizing the importance of in-context learning. We believe MultiVerse is a
landscape of evaluating multi-turn interaction abilities for VLMs.