EvalTree:透過層次化能力樹剖析語言模型的弱點
EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
March 11, 2025
作者: Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh
cs.AI
摘要
理想的模型評估應達成兩個目標:識別模型失敗之處,並提供可操作的改進指引。針對語言模型(LM)評估的這些目標,我們將生成弱點描述的問題形式化,即在給定LM在基準測試中每個單獨實例的表現後,生成一組以自然語言表達的弱點。我們引入了一套定量評估方法來比較不同的弱點描述方法。我們還提出了一種名為EvalTree的弱點描述方法。該方法構建了一個能力樹,其中每個節點代表一種以自然語言描述的能力,並與專門評估該能力的基準測試實例子集相連結;然後提取LM表現不佳的節點以生成弱點描述。在MATH和WildChat基準測試中,我們展示了EvalTree通過更精確和全面地識別弱點,優於基線的弱點描述方法。弱點描述進一步促進了基於弱點的數據收集,而由EvalTree識別的弱點引導的訓練數據收集,相比其他數據收集策略,更能提升LM的性能。我們還展示了EvalTree如何揭露Chatbot Arena基於人類投票的評估實踐中的缺陷。為了促進未來的研究,我們發布了我們的代碼和一個界面,使實踐者能夠互動式地探索由EvalTree構建的能力樹。
English
An ideal model evaluation should achieve two goals: identifying where the
model fails and providing actionable improvement guidance. Toward these goals
for Language Model (LM) evaluations, we formulate the problem of generating a
weakness profile, a set of weaknesses expressed in natural language, given an
LM's performance on every individual instance in a benchmark. We introduce a
suite of quantitative assessments to compare different weakness profiling
methods. We also propose a weakness profiling method EvalTree. It constructs a
capability tree where each node represents a capability described in natural
language and is linked to a subset of benchmark instances that specifically
evaluate this capability; it then extracts nodes where the LM performs poorly
to generate a weakness profile. On the MATH and WildChat benchmarks, we show
that EvalTree outperforms baseline weakness profiling methods by identifying
weaknesses more precisely and comprehensively. Weakness profiling further
enables weakness-guided data collection, and training data collection guided by
EvalTree-identified weaknesses improves LM performance more than other data
collection strategies. We also show how EvalTree exposes flaws in Chatbot
Arena's human-voter-based evaluation practice. To facilitate future work, we
release our code and an interface that allows practitioners to interactively
explore the capability trees built by EvalTree.Summary
AI-Generated Summary