なぜTransformerは乗算を学習できないのか？逆解析から明らかになる長距離依存性の落とし穴

要旨

言語モデルはますます高度な能力を発揮しているが、多桁の乗算という一見単純なタスクにおいては未だに失敗する。本研究では、暗黙的な連鎖思考（chain-of-thought）を介して乗算を学習するモデルを逆解析し、以下の3つの発見を報告する：(1) 長距離構造の証拠：ロジット帰属分析と線形プローブにより、モデルが多桁乗算に必要な長距離依存性を符号化していることが示された。(2) メカニズム：モデルは、注意機構を用いて有向非巡回グラフを構築し、ペアワイズ部分積を「キャッシュ」および「取得」することで長距離依存性を符号化している。(3) 幾何学的構造：モデルは、注意ヘッド内でミンコフスキー和を形成し、フーリエ基底を用いて数字を表現することで部分積を実装している。これらは直感的かつ効率的な表現であり、標準的なファインチューニングモデルには欠けているものである。これらの知見をもとに、標準的なファインチューニングの学習ダイナミクスを再検討した結果、モデルが必要な長距離依存性を欠く局所最適解に収束することがわかった。さらに、線形回帰プローブを用いて「累積和」を予測する補助損失を導入することで、この理解を検証し、モデルが多桁乗算を成功裏に学習するための帰納的バイアスを提供した。要約すると、暗黙的な連鎖思考モデルのメカニズムを逆解析することで、Transformerにおける長距離依存性の学習における落とし穴を明らかにし、適切な帰納的バイアスがこの問題を解決する一例を示した。

English

Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via implicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

なぜTransformerは乗算を学習できないのか？逆解析から明らかになる長距離依存性の落とし穴

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

要旨

Support