EvalTree: 階層的機能ツリーによる言語モデルの弱点プロファイリング

要旨

理想的なモデル評価は、2つの目標を達成すべきである：モデルが失敗する箇所を特定し、改善のための具体的な指針を提供すること。言語モデル（LM）評価におけるこれらの目標に向けて、我々はベンチマークにおける個々のインスタンスでのLMの性能を基に、自然言語で表現された弱点の集合である弱点プロファイルを生成する問題を定式化する。異なる弱点プロファイリング手法を比較するための定量的評価スイートを導入する。また、弱点プロファイリング手法EvalTreeを提案する。EvalTreeは、各ノードが自然言語で記述された能力を表し、その能力を具体的に評価するベンチマークインスタンスのサブセットにリンクされた能力ツリーを構築する。その後、LMの性能が低いノードを抽出して弱点プロファイルを生成する。MATHおよびWildChatベンチマークにおいて、EvalTreeがベースラインの弱点プロファイリング手法を上回り、より正確かつ包括的に弱点を特定することを示す。弱点プロファイリングはさらに、弱点に基づくデータ収集を可能にし、EvalTreeが特定した弱点に基づいて収集されたトレーニングデータは、他のデータ収集戦略よりもLMの性能を向上させる。また、EvalTreeがChatbot Arenaの人間投票者ベースの評価手法の欠点を明らかにする方法も示す。今後の研究を促進するため、我々はコードと、EvalTreeによって構築された能力ツリーをインタラクティブに探索できるインターフェースを公開する。

English

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

EvalTree: 階層的機能ツリーによる言語モデルの弱点プロファイリング

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

要旨

Support