最弱链接定律:大型语言模型的交叉能力
Law of the Weakest Link: Cross Capabilities of Large Language Models
September 30, 2024
作者: Ming Zhong, Aston Zhang, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten
cs.AI
摘要
大型语言模型(LLMs)的开发和评估主要集中在个体能力上。然而,这忽视了跨不同类型专业技能的多种能力相交集,而这些技能通常是实现真实世界任务所必需的,我们将其称为跨能力。为了系统地探索这一概念,我们首先定义了七种核心个体能力,然后将它们配对形成七种常见的跨能力,每种跨能力都由一个手动构建的分类法支持。基于这些定义,我们引入了CrossEval,一个包含1,400个人工注释提示的基准测试,每种个体和跨能力各有100个提示。为了确保可靠评估,我们邀请专家注释员评估4,200个模型响应,收集了8,400个带有详细解释的人工评分作为参考示例。我们的研究结果显示,无论是在静态评估还是试图增强特定能力方面,当前的LLMs都始终表现出“最弱环节法则”,即跨能力表现受到最弱组成部分的显著限制。具体而言,在来自17个模型的58个跨能力得分中,有38个得分低于所有个体能力,而20个介于强和弱之间,但更接近较弱的能力。这些结果突显了LLMs在跨能力任务中的表现不佳,使得识别和改进最弱能力成为未来研究中优化在复杂、多维场景中表现的关键优先事项。
English
The development and evaluation of Large Language Models (LLMs) have largely
focused on individual capabilities. However, this overlooks the intersection of
multiple abilities across different types of expertise that are often required
for real-world tasks, which we term cross capabilities. To systematically
explore this concept, we first define seven core individual capabilities and
then pair them to form seven common cross capabilities, each supported by a
manually constructed taxonomy. Building on these definitions, we introduce
CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100
prompts for each individual and cross capability. To ensure reliable
evaluation, we involve expert annotators to assess 4,200 model responses,
gathering 8,400 human ratings with detailed explanations to serve as reference
examples. Our findings reveal that, in both static evaluations and attempts to
enhance specific abilities, current LLMs consistently exhibit the "Law of the
Weakest Link," where cross-capability performance is significantly constrained
by the weakest component. Specifically, across 58 cross-capability scores from
17 models, 38 scores are lower than all individual capabilities, while 20 fall
between strong and weak, but closer to the weaker ability. These results
highlight the under-performance of LLMs in cross-capability tasks, making the
identification and improvement of the weakest capabilities a critical priority
for future research to optimize performance in complex, multi-dimensional
scenarios.Summary
AI-Generated Summary