ChatPaper.aiChatPaper

ProBench:評估多模態基礎模型在開放式多領域專家任務上的表現

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

March 10, 2025
作者: Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li
cs.AI

摘要

解決專家級多模態任務是邁向通用智能的關鍵里程碑。隨著多模態大型語言模型(MLLMs)能力的不斷提升,對這類先進多模態智能的評估變得必要且具有挑戰性。在本研究中,我們引入了ProBench,這是一個基於開放式用戶查詢的基準測試,這些查詢需要專業知識和高級推理能力。ProBench包含4,000個由專業人士根據其日常生產需求獨立提交的高質量樣本,涵蓋10個領域和56個子領域,包括科學、藝術、人文、編程、數學和創意寫作。在實驗中,我們使用MLLM-as-a-Judge評估並比較了24個最新模型。結果顯示,儘管最佳開源模型可與專有模型媲美,但ProBench在視覺感知、文本理解、領域知識和高級推理方面提出了顯著挑戰,從而為未來多模態AI研究提供了有價值的方向。
English
Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

Summary

AI-Generated Summary

PDF33March 11, 2025