ChatPaper.aiChatPaper

EvalTree:通过层次化能力树剖析语言模型的弱点

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

March 11, 2025
作者: Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh
cs.AI

摘要

理想的模型评估应达成两大目标:识别模型失败之处并提供可操作的改进指导。针对语言模型(LM)评估的这两个目标,我们提出了生成弱点描述的问题,即在给定LM在基准测试中每个实例上的表现后,用自然语言表达出一组弱点。我们引入了一套定量评估方法,用于比较不同的弱点描述方法。同时,我们提出了一种名为EvalTree的弱点描述方法。该方法构建了一个能力树,其中每个节点代表一种以自然语言描述的能力,并与专门评估该能力的基准测试实例子集相关联;随后,它提取LM表现不佳的节点,生成弱点描述。在MATH和WildChat基准测试上,我们展示了EvalTree通过更精确和全面地识别弱点,优于基线弱点描述方法。弱点描述进一步支持了基于弱点的数据收集,而由EvalTree识别出的弱点指导的训练数据收集,相比其他数据收集策略,更能提升LM的性能。我们还展示了EvalTree如何揭示Chatbot Arena基于人类投票评估实践的缺陷。为了促进未来研究,我们发布了我们的代码及一个界面,使实践者能够交互式探索EvalTree构建的能力树。
English
An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

Summary

AI-Generated Summary

PDF52March 19, 2025