ChatPaper.aiChatPaper

MultiVerse:面向大型视觉与语言模型评估的多轮对话基准

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

October 18, 2025
作者: Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, Ho-Jin Choi
cs.AI

摘要

视觉-语言模型(VLMs)在单轮基准测试中展现了卓越的能力,然而实际应用往往需要更为复杂的多轮对话。现有的多轮对话数据集(如MMDU、ConvBench)仅部分捕捉了用户所遇对话场景的广度和深度。本研究中,我们推出了MultiVerse,一个新颖的多轮对话基准测试,包含647个对话——每个对话平均四轮——源自12个流行的VLM评估基准。MultiVerse涵盖484项任务和484个互动目标,主题广泛,从事实知识与感知到数学和编程等高级推理任务。为促进全面评估,我们提出了一种基于清单的评估方法,利用GPT-4o作为自动评估器,衡量包括感知准确性、语言清晰度和事实正确性在内的37个关键方面的表现。我们在MultiVerse上评估了18个VLMs,发现即便是最强大的模型(如GPT-4o)在复杂的多轮对话中也仅能达到50%的成功率,凸显了该数据集的挑战性。值得注意的是,我们发现为较小或较弱的模型提供完整的对话上下文能显著提升其表现,强调了上下文学习的重要性。我们相信MultiVerse是评估VLMs多轮互动能力的理想平台。
English
Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.
PDF32October 21, 2025