最弱リンクの法則：大規模言語モデルのクロス能力

要旨

大規模言語モデル（LLM）の開発と評価は、主に個々の能力に焦点を当ててきました。しかし、これにはしばしば現実世界のタスクに必要とされる異なる種類の専門知識を横断する複数の能力の交差点が見落とされています。これを「クロス能力」と呼んでいます。この概念を体系的に探るために、まず7つの中核的な個々の能力を定義し、それらを組み合わせて7つの一般的なクロス能力を形成しました。各クロス能力は、手作業で作成された分類法に基づいています。これらの定義に基づいて、1,400の人間による注釈付きプロンプトからなるベンチマークであるCrossEvalを紹介しています。各個々の能力とクロス能力につき100のプロンプトが含まれています。信頼性のある評価を確保するために、専門家の注釈付け者によって4,200のモデル応答が評価され、8,400の人間による評価が収集され、詳細な説明が付され、参照例として機能しています。私たちの調査結果によると、静的評価と特定の能力の向上を試みる際、現在のLLMは一貫して「最も弱いリンクの法則」を示しており、クロス能力のパフォーマンスが著しく最も弱い部分に制約されていることが明らかになりました。具体的には、17のモデルからの58のクロス能力スコアにおいて、38のスコアがすべての個々の能力よりも低く、20のスコアが強い能力と弱い能力の間に位置していますが、より弱い能力に近い位置にあります。これらの結果は、LLMのクロス能力タスクでの低性能を強調し、将来の研究において最適なパフォーマンスを実現するために、最も弱い能力の特定と改善が重要な優先事項であることを示しています。

English

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

最弱リンクの法則：大規模言語モデルのクロス能力

Law of the Weakest Link: Cross Capabilities of Large Language Models

要旨

Support