ChatPaper.aiChatPaper

LMMs-Eval:對大型多模型模型評估的現實檢驗

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

July 17, 2024
作者: Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu
cs.AI

摘要

大型基礎模型的進步需要廣泛覆蓋、低成本和零污染的基準。儘管對語言模型評估進行了持續探索,但對大型多模態模型(LMMs)的評估進行了全面研究仍然有限。在這項工作中,我們介紹了LMMS-EVAL,一個統一標準的多模態基準框架,包含50多個任務和10多個模型,以促透明和可重現的評估。儘管LMMS-EVAL提供了全面的覆蓋範圍,但我們發現它在實現低成本和零污染方面仍有不足。為了應對這一評估三難問題,我們進一步引入了LMMS-EVAL LITE,一個強調覆蓋範圍和效率的精簡評估工具包。此外,我們提出了Multimodal LIVEBENCH,利用不斷更新的新聞和在線論壇來評估模型在野外的泛化能力,具有低成本和零污染的評估方法。總之,我們的工作強調考慮評估三難問題的重要性,並提供實際解決方案來平衡評估大型多模態模型的取捨,為更有效和可靠地評估LMMs的基準鋪平道路。我們將我們的代碼庫開源,並在https://github.com/EvolvingLMMs-Lab/lmms-eval 和 https://huggingface.co/spaces/lmms-lab/LiveBench 上維護LIVEBENCH的排行榜。
English
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

Summary

AI-Generated Summary

PDF364November 28, 2024