ChatPaper.aiChatPaper

LMMs-Eval:对大型多模态模型评估的现实检验

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

July 17, 2024
作者: Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu
cs.AI

摘要

大型基础模型的进展需要广覆盖、低成本和零污染的基准。尽管对语言模型评估进行了持续探索,但对大型多模态模型(LMMs)评估的综合研究仍然有限。在这项工作中,我们介绍了LMMS-EVAL,一个统一标准的多模态基准框架,涵盖了50多项任务和超过10个模型,以促进透明和可重现的评估。尽管LMMS-EVAL提供了全面的覆盖范围,但我们发现它在实现低成本和零污染方面仍然存在不足。为了解决这一评估三难问题,我们进一步推出了LMMS-EVAL LITE,一个强调覆盖范围和效率的精简评估工具包。此外,我们提出了Multimodal LIVEBENCH,利用不断更新的新闻和在线论坛来评估模型在真实环境中的泛化能力,采用低成本和零污染的评估方法。总之,我们的工作强调考虑评估三难问题的重要性,并提供了实际解决方案,以应对评估大型多模态模型时的权衡,为更有效可靠地对LMMs进行基准测试铺平道路。我们在https://github.com/EvolvingLMMs-Lab/lmms-eval和https://huggingface.co/spaces/lmms-lab/LiveBench上开源我们的代码库,并维护LIVEBENCH的排行榜。
English
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

Summary

AI-Generated Summary

PDF364November 28, 2024