ChatPaper.aiChatPaper

MMAU:跨多個領域綜合評估智能體能力的基準

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

July 18, 2024
作者: Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang, Jiulong Shan, Meng Cao, Ruoming Pang, Zirui Wang
cs.AI

摘要

近期大型語言模型(LLMs)的進步增加了對全面基準的需求,以評估它們作為類人代理的能力。現有的基準雖然有用,但通常專注於特定應用場景,強調任務完成,但未能解析驅動這些結果的基本技能。這種缺乏細微度使得深入識別失敗來源變得困難。此外,建立這些環境需要相當大的努力,而在互動任務中有時會出現不可靠性和再現性問題。為了應對這些限制,我們引入了 Massive Multitask Agent Understanding(MMAU)基準,其中包括全面的離線任務,消除了對複雜環境設置的需求。它評估模型在五個領域中,包括工具使用、有向無環圖(DAG)問答、數據科學和機器學習編碼、比賽級編程和數學,並涵蓋五個基本能力:理解、推理、規劃、解決問題和自我校正。MMAU總共包含20個精心設計的任務,涵蓋超過3K個不同提示,為評估LLM代理的優勢和限制提供了全面的框架。通過在MMAU上測試18個代表性模型,我們提供了深入而富有洞察力的分析。最終,MMAU不僅揭示了LLM代理的能力和限制,還增強了其性能的可解釋性。MMAU的數據集和評估腳本已發布在 https://github.com/apple/axlearn/docs/research/mmau。
English
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including teal{Tool-use}, teal{Directed Acyclic Graph (DAG) QA}, teal{Data Science and Machine Learning coding}, teal{Contest-level programming} and teal{Mathematics}, and covers five essential capabilities: orange{Understanding}, orange{Reasoning}, orange{Planning}, orange{Problem-solving}, and orange{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/docs/research/mmau.

Summary

AI-Generated Summary

PDF414November 28, 2024