透過因果表徵學習揭示語言模型的分層潛在能力

摘要

對語言模型能力的忠實評估，對於獲取能夠指導模型開發的可操作見解至關重要。然而，在這一領域進行嚴格的因果評估面臨著重大的方法論挑戰，包括複雜的混淆效應以及與廣泛重新訓練相關的禁止性計算成本。為應對這些挑戰，我們提出了一種因果表示學習框架，其中觀察到的基準性能被建模為少數潛在能力因子的線性轉換。關鍵在於，這些潛在因子在適當控制基礎模型作為共同混淆變量後，被識別為因果相關的。將此方法應用於涵蓋Open LLM Leaderboard上六個基準評估的超過1500個模型的綜合數據集，我們識別出了一個簡潔的三節點線性因果結構，該結構可靠地解釋了觀察到的性能變化。進一步解釋這一因果結構，提供了超越簡單數值排名的實質性科學見解：具體而言，我們揭示了一個清晰的因果方向，從一般問題解決能力開始，通過指令遵循熟練度推進，最終達到數學推理能力。我們的結果強調了在評估過程中仔細控制基礎模型變化的基本作用，這一步驟對於準確揭示潛在模型能力之間的根本因果關係至關重要。

English

Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

透過因果表徵學習揭示語言模型的分層潛在能力

Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

摘要

Support