因果的表現学習による言語モデルの階層的潜在能力の発見

要旨

言語モデルの能力を忠実に評価することは、モデル開発に役立つ実践的な知見を得る上で極めて重要です。しかし、この領域における厳密な因果的評価は、複雑な交絡効果や大規模な再学習に伴う膨大な計算コストといった、重要な方法論的課題に直面しています。これらの課題に対処するため、我々は因果的表現学習フレームワークを提案します。このフレームワークでは、観測されたベンチマーク性能を、少数の潜在能力因子の線形変換としてモデル化します。重要な点として、これらの潜在因子は、基本モデルを共通の交絡因子として適切に制御した後、因果的に関連付けられているものとして特定されます。このアプローチをOpen LLM Leaderboardの6つのベンチマークで評価された1500以上のモデルを含む包括的なデータセットに適用した結果、観測された性能変動を確実に説明する簡潔な3ノードの線形因果構造を特定しました。この因果構造をさらに解釈することで、単なる数値的な順位付けを超えた重要な科学的知見が得られました。具体的には、一般的な問題解決能力から始まり、指示追従能力を経て、数学的推論能力に至る明確な因果的方向性が明らかになりました。我々の結果は、潜在的なモデル能力間の根底にある因果関係を正確に解明する上で、評価時に基本モデルの変動を慎重に制御することが不可欠であることを強く示唆しています。

English

Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

因果的表現学習による言語モデルの階層的潜在能力の発見

Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

要旨

Support