EvalTree：通过层次化能力树剖析语言模型的弱点

摘要

理想的模型评估应达成两大目标：识别模型失败之处并提供可操作的改进指导。针对语言模型（LM）评估的这两个目标，我们提出了生成弱点描述的问题，即在给定LM在基准测试中每个实例上的表现后，用自然语言表达出一组弱点。我们引入了一套定量评估方法，用于比较不同的弱点描述方法。同时，我们提出了一种名为EvalTree的弱点描述方法。该方法构建了一个能力树，其中每个节点代表一种以自然语言描述的能力，并与专门评估该能力的基准测试实例子集相关联；随后，它提取LM表现不佳的节点，生成弱点描述。在MATH和WildChat基准测试上，我们展示了EvalTree通过更精确和全面地识别弱点，优于基线弱点描述方法。弱点描述进一步支持了基于弱点的数据收集，而由EvalTree识别出的弱点指导的训练数据收集，相比其他数据收集策略，更能提升LM的性能。我们还展示了EvalTree如何揭示Chatbot Arena基于人类投票评估实践的缺陷。为了促进未来研究，我们发布了我们的代码及一个界面，使实践者能够交互式探索EvalTree构建的能力树。

English

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

EvalTree：通过层次化能力树剖析语言模型的弱点

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

摘要

Support