WildVision:利用人類偏好在真實環境中評估視覺語言模型
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
June 16, 2024
作者: Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin
cs.AI
摘要
最近在視覺語言模型(VLMs)方面取得的突破強調了在現實世界多模態互動中基準人類偏好的必要性。為了填補這一空白,我們推出了WildVision-Arena(WV-Arena),這是一個在線平台,用於收集人類偏好以評估VLMs。我們通過從WV-Arena的8,000個用戶提交中選擇了500個高質量樣本來為WV-Bench進行了精心策劃。WV-Bench使用GPT-4作為評判,將每個VLM與Claude-3-Sonnet進行比較,與WV-Arena Elo達到0.94的Spearman相關性。這明顯優於其他基準,如MMVet、MMMU和MMStar。
我們對20K個現實世界互動的全面分析揭示了頂尖VLMs失敗案例的重要見解。例如,我們發現,儘管GPT-4V在簡單的視覺識別和推理任務中超越了許多其他模型,如Reka-Flash、Opus和Yi-VL-Plus,但在微妙的上下文提示、空間推理、視覺想像和專家領域知識方面仍然面臨挑戰。此外,當故意挑釁時,當前的VLMs存在幻覺和安全問題。我們將釋放我們的聊天和反饋數據,以進一步推動VLMs領域的研究。
English
Recent breakthroughs in vision-language models (VLMs) emphasize the necessity
of benchmarking human preferences in real-world multimodal interactions. To
address this gap, we launched WildVision-Arena (WV-Arena), an online platform
that collects human preferences to evaluate VLMs. We curated WV-Bench by
selecting 500 high-quality samples from 8,000 user submissions in WV-Arena.
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet,
achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This
significantly outperforms other benchmarks like MMVet, MMMU, and MMStar.
Our comprehensive analysis of 20K real-world interactions reveals important
insights into the failure cases of top-performing VLMs. For example, we find
that although GPT-4V surpasses many other models like Reka-Flash, Opus, and
Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces
challenges with subtle contextual cues, spatial reasoning, visual imagination,
and expert domain knowledge. Additionally, current VLMs exhibit issues with
hallucinations and safety when intentionally provoked. We are releasing our
chat and feedback data to further advance research in the field of VLMs.Summary
AI-Generated Summary