ChatPaper.aiChatPaper

MMAU:跨多领域代理能力的整体基准

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

July 18, 2024
作者: Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang, Jiulong Shan, Meng Cao, Ruoming Pang, Zirui Wang
cs.AI

摘要

最近大型语言模型(LLMs)的进展增加了对全面基准的需求,以评估它们作为类人代理的能力。现有的基准虽然有用,但通常侧重于特定应用场景,强调任务完成,但未能剖析驱动这些结果的基本技能。这种缺乏细致度使得深入辨别失败原因变得困难。此外,设置这些环境需要相当大的努力,尤其是在交互任务中,有时会出现不可靠性和可重现性问题。为了解决这些限制,我们引入了 Massive Multitask Agent Understanding(MMAU)基准,包括全面的离线任务,消除了复杂环境设置的需求。它评估了模型在五个领域的表现,包括工具使用、有向无环图(DAG)问答、数据科学和机器学习编码、比赛级编程和数学,并涵盖了五个基本能力:理解、推理、规划、解决问题和自我纠正。MMAU共包含20个精心设计的任务,涵盖了超过3K个不同提示,为评估LLM代理的优势和局限性提供了全面的框架。通过在MMAU上测试18个代表性模型,我们提供了深入而有见地的分析。最终,MMAU不仅揭示了LLM代理的能力和局限性,还增强了其性能的可解释性。MMAU的数据集和评估脚本发布在 https://github.com/apple/axlearn/docs/research/mmau。
English
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including teal{Tool-use}, teal{Directed Acyclic Graph (DAG) QA}, teal{Data Science and Machine Learning coding}, teal{Contest-level programming} and teal{Mathematics}, and covers five essential capabilities: orange{Understanding}, orange{Reasoning}, orange{Planning}, orange{Problem-solving}, and orange{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/docs/research/mmau.

Summary

AI-Generated Summary

PDF414November 28, 2024