MMAU: 다양한 도메인에 걸친 에이전트 능력에 대한 종합적 벤치마크

초록

대규모 언어 모델(LLMs)의 최근 발전으로 인간과 유사한 에이전트로서의 능력을 평가하기 위한 포괄적인 벤치마크에 대한 수요가 증가하고 있습니다. 기존 벤치마크는 유용하지만 특정 응용 시나리오에 초점을 맞추는 경향이 있어, 작업 완료를 강조하면서도 이러한 결과를 이끄는 근본적인 기술을 세분화하지 못합니다. 이러한 세분화의 부재는 실패의 원인을 깊이 있게 파악하기 어렵게 만듭니다. 또한, 이러한 환경을 설정하는 데 상당한 노력이 필요하며, 특히 상호작용 작업에서 신뢰성과 재현성 문제가 발생하기도 합니다. 이러한 한계를 해결하기 위해, 우리는 복잡한 환경 설정이 필요 없는 포괄적인 오프라인 작업을 특징으로 하는 Massive Multitask Agent Understanding (MMAU) 벤치마크를 소개합니다. MMAU는 teal{도구 사용}, teal{방향성 비순환 그래프(DAG) 질의응답}, teal{데이터 과학 및 머신러닝 코딩}, teal{대회 수준 프로그래밍}, teal{수학} 등 다섯 가지 도메인에 걸쳐 모델을 평가하며, orange{이해}, orange{추론}, orange{계획}, orange{문제 해결}, orange{자기 수정} 등 다섯 가지 필수 능력을 다룹니다. 총 20개의 세심하게 설계된 작업과 3,000개 이상의 독특한 프롬프트를 포함한 MMAU는 LLM 에이전트의 강점과 한계를 평가하기 위한 포괄적인 프레임워크를 제공합니다. MMAU에서 18개의 대표적인 모델을 테스트함으로써, 우리는 깊이 있고 통찰력 있는 분석을 제공합니다. 궁극적으로, MMAU는 LLM 에이전트의 능력과 한계를 밝히는 동시에 그들의 성능 해석력을 향상시킵니다. MMAU의 데이터셋과 평가 스크립트는 https://github.com/apple/axlearn/docs/research/mmau에서 공개되었습니다.

English

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including teal{Tool-use}, teal{Directed Acyclic Graph (DAG) QA}, teal{Data Science and Machine Learning coding}, teal{Contest-level programming} and teal{Mathematics}, and covers five essential capabilities: orange{Understanding}, orange{Reasoning}, orange{Planning}, orange{Problem-solving}, and orange{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/docs/research/mmau.

MMAU: 다양한 도메인에 걸친 에이전트 능력에 대한 종합적 벤치마크

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

초록

Support