LLM推論トレースにおける認知エピソードが解釈可能な人間の項目困難度予測を可能にする

要旨

人間の項目難易度の予測は教育評価の中心的な課題であり、信頼性の高い推定値は公平性と効果的なテスト構成を支える。既存の手法は多くの場合、高コストな人間による調整や項目レベルのテキスト表現に依存しており、項目を困難にする認知的プロセスに関する限られたエビデンスしか提供しない。我々は、難易度は項目テキストの特性としてだけでなく、項目が誘発する問題解決負荷の観察可能な結果としても捉えられるべきだと主張する。大規模推論モデル（LRM）は推論軌跡を通じてスケーラブルなプロセスエビデンスを提供するが、そのようなエビデンスは解釈可能なモデリングを支えるよう構造化されなければならない。この目的のために、我々はEpi2Diff（エピソードから難易度へ）というフレームワークを導入する。これはLRMの推論軌跡を認知的に基づいたエピソード系列にマッピングする。これらのエピソードは軌跡のセグメントを機能的な問題解決状態にグループ化し、推論の規模、努力配分、状態遷移を通じて難易度をモデル化することを可能にする。Epi2Diffはコンパクトなエピソードダイナミクス特徴を抽出し、それらを意味的な項目表現と組み合わせて人間の難易度予測を行う。4つの実世界の人間難易度データセットを用いた実験では、Epi2Diffが微調整済み小型言語モデル、LLMのインコンテキスト学習、教師ありLLM適応などの強力なベースラインを一貫して上回る。SAT由来の分類ベンチマークでは、Epi2Diffは教師ありLLM微調整ベースラインに対して平均8.1%の相対的な改善を達成する。さらなる分析により、困難な項目ほど単により長い応答を生むのではなく、より労力を要し、反復的で、実装中心のエピソードダイナミクスを誘発することが示された。これらの結果は、LRM推論軌跡における認知エピソードが人間の項目難易度に対する予測可能かつ解釈可能なプロセス表現を提供し、推論モデルを用いた教育測定に新たな視点をもたらすことを実証している。

English

Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes that make items difficult. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem-solving burden an item induces. Large Reasoning Models (LRMs) offer scalable process evidence through reasoning traces, but such evidence must be structured to support interpretable modeling. To this end, we introduce Epi2Diff (Episode to Difficulty), a framework that maps LRM reasoning traces into cognitively grounded episode sequences. These episodes group trace segments into functional problem-solving states, enabling difficulty to be modeled through reasoning scale, effort allocation, and state transitions. Epi2Diff extracts compact episode-dynamic features and combines them with semantic item representations for human difficulty prediction. Experiments on four real-world human difficulty datasets show that Epi2Diff consistently outperforms strong baselines, including fine-tuned small language models, LLM in-context learning, and supervised LLM adaptation. On SAT-derived classification benchmarks, Epi2Diff achieves an 8.1% average relative gain over supervised LLM fine-tuning baselines. Further analyses show that harder items induce more effortful, iterative, and implementation-centered episode dynamics, rather than merely longer responses. These results demonstrate that cognitive episodes in LRM reasoning traces provide a predictive and interpretable process representation for human item difficulty, offering a new lens for educational measurement with reasoning models.