通过因果表征学习揭示语言模型的分层潜在能力
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning
June 12, 2025
作者: Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang
cs.AI
摘要
对语言模型能力进行忠实评估,对于获取可指导模型开发的实际洞见至关重要。然而,在这一领域开展严谨的因果评估面临着重大的方法论挑战,包括复杂的混杂效应以及与大规模重新训练相关的高昂计算成本。为应对这些挑战,我们提出了一种因果表示学习框架,其中将观察到的基准性能建模为少数潜在能力因子的线性变换。关键在于,在适当控制基础模型作为共同混杂因素后,这些潜在因子被识别为因果相关的。将此方法应用于涵盖Open LLM Leaderboard上六个基准评估的1500多个模型的综合数据集,我们发现了一个简洁的三节点线性因果结构,该结构可靠地解释了观察到的性能差异。进一步解读这一因果结构,提供了超越简单数值排名的实质性科学洞见:具体而言,我们揭示了一个明确的因果方向,从通用问题解决能力出发,经过指令遵循熟练度,最终达到数学推理能力。我们的结果强调了在评估过程中仔细控制基础模型变体的关键作用,这一步骤对于准确揭示潜在模型能力间的基础因果关系至关重要。
English
Faithful evaluation of language model capabilities is crucial for deriving
actionable insights that can inform model development. However, rigorous causal
evaluations in this domain face significant methodological challenges,
including complex confounding effects and prohibitive computational costs
associated with extensive retraining. To tackle these challenges, we propose a
causal representation learning framework wherein observed benchmark performance
is modeled as a linear transformation of a few latent capability factors.
Crucially, these latent factors are identified as causally interrelated after
appropriately controlling for the base model as a common confounder. Applying
this approach to a comprehensive dataset encompassing over 1500 models
evaluated across six benchmarks from the Open LLM Leaderboard, we identify a
concise three-node linear causal structure that reliably explains the observed
performance variations. Further interpretation of this causal structure
provides substantial scientific insights beyond simple numerical rankings:
specifically, we reveal a clear causal direction starting from general
problem-solving capabilities, advancing through instruction-following
proficiency, and culminating in mathematical reasoning ability. Our results
underscore the essential role of carefully controlling base model variations
during evaluation, a step critical to accurately uncovering the underlying
causal relationships among latent model capabilities.