LMMs-Eval: 대규모 멀티모달 모델 평가에 대한 현실 점검

초록

대형 기반 모델의 발전은 광범위한 커버리지, 낮은 비용, 그리고 오염 없는 벤치마크의 필요성을 요구하고 있습니다. 언어 모델 평가에 대한 지속적인 탐구에도 불구하고, 대형 다중 모달 모델(Large Multi-modal Models, LMMs) 평가에 대한 포괄적인 연구는 여전히 제한적입니다. 본 연구에서는 LMMS-EVAL을 소개합니다. 이는 50개 이상의 작업과 10개 이상의 모델을 포함한 통일되고 표준화된 다중 모달 벤치마크 프레임워크로, 투명하고 재현 가능한 평가를 촉진합니다. LMMS-EVAL이 포괄적인 커버리지를 제공함에도 불구하고, 여전히 낮은 비용과 오염 없는 평가를 달성하는 데는 부족함이 있습니다. 이 평가 삼중고에 접근하기 위해, 우리는 커버리지와 효율성을 모두 강조하는 LMMS-EVAL LITE라는 정제된 평가 툴킷을 추가로 소개합니다. 또한, 지속적으로 업데이트되는 뉴스와 온라인 포럼을 활용하여 모델의 실제 환경에서의 일반화 능력을 평가하는 Multimodal LIVEBENCH를 제시합니다. 이는 낮은 비용과 오염 없는 평가 접근법을 특징으로 합니다. 요약하자면, 본 연구는 평가 삼중고를 고려하는 것의 중요성을 강조하고, 대형 다중 모달 모델 평가에서의 트레이드오프를 극복하기 위한 실질적인 해결책을 제공함으로써, LMMs의 더 효과적이고 신뢰할 수 있는 벤치마킹을 위한 길을 열어줍니다. 우리는 코드베이스를 오픈소스로 공개하고, LIVEBENCH의 리더보드를 https://github.com/EvolvingLMMs-Lab/lmms-eval과 https://huggingface.co/spaces/lmms-lab/LiveBench에서 유지합니다.

English

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

LMMs-Eval: 대규모 멀티모달 모델 평가에 대한 현실 점검

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

초록

Support