WildVision:利用人类偏好在真实环境中评估视觉-语言模型
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
June 16, 2024
作者: Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin
cs.AI
摘要
最近视觉-语言模型(VLMs)方面的突破强调了在真实世界多模态交互中基准人类偏好的必要性。为了填补这一空白,我们推出了WildVision-Arena(WV-Arena),这是一个在线平台,用于收集人类偏好以评估VLMs。我们通过在WV-Arena中选择8000个用户提交中的500个高质量样本,策划了WV-Bench。WV-Bench使用GPT-4作为评判者,将每个VLM与Claude-3-Sonnet进行比较,与WV-Arena Elo达到0.94的Spearman相关性。这明显优于其他基准,如MMVet、MMMU和MMStar。
我们对2万个真实世界交互的全面分析揭示了表现最佳的VLMs的失败案例的重要见解。例如,我们发现,尽管GPT-4V在简单的视觉识别和推理任务中超过了许多其他模型,如Reka-Flash、Opus和Yi-VL-Plus,但它仍然面临着对微妙的上下文线索、空间推理、视觉想象力和专家领域知识的挑战。此外,当前的VLMs在故意挑衅时存在幻觉和安全问题。我们将发布我们的聊天和反馈数据,以进一步推动VLMs领域的研究。
English
Recent breakthroughs in vision-language models (VLMs) emphasize the necessity
of benchmarking human preferences in real-world multimodal interactions. To
address this gap, we launched WildVision-Arena (WV-Arena), an online platform
that collects human preferences to evaluate VLMs. We curated WV-Bench by
selecting 500 high-quality samples from 8,000 user submissions in WV-Arena.
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet,
achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This
significantly outperforms other benchmarks like MMVet, MMMU, and MMStar.
Our comprehensive analysis of 20K real-world interactions reveals important
insights into the failure cases of top-performing VLMs. For example, we find
that although GPT-4V surpasses many other models like Reka-Flash, Opus, and
Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces
challenges with subtle contextual cues, spatial reasoning, visual imagination,
and expert domain knowledge. Additionally, current VLMs exhibit issues with
hallucinations and safety when intentionally provoked. We are releasing our
chat and feedback data to further advance research in the field of VLMs.Summary
AI-Generated Summary