重探不同难度级别的泛化能力:并非易事
Revisiting Generalization Across Difficulty Levels: It's Not So Easy
November 26, 2025
作者: Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach
cs.AI
摘要
我们研究了大语言模型(LLM)在不同难度任务间的泛化能力,这是影响数据筛选与评估效果的核心问题。现有研究对于"使用简单数据还是困难数据训练能获得更好效果"以及"效果提升体现在简单还是困难测试数据上"等问题尚未达成共识。为解决这一争议,我们系统评估了LLM在模型、数据集及细粒度难度分组间的泛化表现。通过运用数千种不同LLM的输出结果与教育测试领域成熟的难度度量指标——项目反应理论(IRT),我们对六个数据集中的样本进行了难度分级。与先前研究不同,我们的难度评级完全基于多种LLM的能力表现,排除了人类对难度的主观判断。通过更客观、大规模且细粒度的分析,我们发现跨难度泛化能力往往有限:仅使用简单或困难数据训练无法在全部难度范围内实现一致提升。这些结果表明,在LLM的训练和评估数据中保持难度多样性至关重要,任何在难度维度上走捷径的做法都存在风险。
English
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.