ChatPaper.aiChatPaper

重探難度層級間的泛化能力:並非易事

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

November 26, 2025
作者: Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach
cs.AI

摘要

我們探討大型語言模型(LLM)在不同任務難度間的泛化能力,此議題對有效資料策劃與評估至關重要。現有研究對於「使用簡單或困難資料訓練能獲得更好結果」以及「成效提升體現於簡單或困難測試資料」尚無定論。為此,我們透過系統化評估LLM在模型、資料集及細粒度難度分組間的泛化表現來解答此問題。我們運用數千種不同LLM的輸出結果與教育測驗中成熟的難度指標——項目反應理論(IRT),對六個資料集中的範例進行難度分級。有別於過往研究,我們的難度評定完全基於多種LLM的自身能力,排除人類對難度的主觀判斷。透過更客觀、大規模且細粒度的分析,我們發現跨難度泛化能力往往有限:僅使用簡單或困難資料進行訓練,無法在整體難度範圍內實現一致性的效能提升。此結果表明,在LLM的訓練與評估資料中納入多樣化難度範例至關重要,且試圖在難度維度上投機取巧具有風險。
English
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
PDF152December 1, 2025