為何變換器無法學習乘法?逆向工程揭示長程依賴性陷阱
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
September 30, 2025
作者: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viégas, Martin Wattenberg, Andrew Lee
cs.AI
摘要
語言模型的能力日益增強,但在多位數乘法這一看似簡單的任務上仍存在不足。本研究通過逆向工程分析一個成功通過隱式思維鏈學習乘法的模型,探討其原因,並報告了三項發現:(1) 長程結構的證據:Logit歸因和線性探針表明,模型編碼了多位數乘法所需的長程依賴關係。(2) 機制:模型利用注意力機制構建有向無環圖來「緩存」和「檢索」成對的部分積,從而編碼長程依賴關係。(3) 幾何:模型在注意力頭中通過形成數字對之間的閔可夫斯基和來實現部分積,並使用傅里葉基表示數字,這些都是直觀且高效的表示方式,而標準微調模型則缺乏這些特性。基於這些見解,我們重新審視了標準微調的學習動態,發現模型收斂到一個缺乏必要長程依賴關係的局部最優解。我們進一步通過引入一個輔助損失函數來驗證這一理解,該函數通過線性回歸探針預測「運行總和」,從而提供了一種歸納偏置,使模型能夠成功學習多位數乘法。總之,通過逆向工程分析隱式思維鏈模型的機制,我們揭示了Transformer在學習長程依賴關係中的一個陷阱,並展示了正確的歸納偏置如何解決這一問題。
English
Language models are increasingly capable, yet still fail at a seemingly
simple task of multi-digit multiplication. In this work, we study why, by
reverse-engineering a model that successfully learns multiplication via
implicit chain-of-thought, and report three findings: (1) Evidence of
long-range structure: Logit attributions and linear probes indicate that the
model encodes the necessary long-range dependencies for multi-digit
multiplication. (2) Mechanism: the model encodes long-range dependencies using
attention to construct a directed acyclic graph to ``cache'' and ``retrieve''
pairwise partial products. (3) Geometry: the model implements partial products
in attention heads by forming Minkowski sums between pairs of digits, and
digits are represented using a Fourier basis, both of which are intuitive and
efficient representations that the standard fine-tuning model lacks. With these
insights, we revisit the learning dynamics of standard fine-tuning and find
that the model converges to a local optimum that lacks the required long-range
dependencies. We further validate this understanding by introducing an
auxiliary loss that predicts the ``running sum'' via a linear regression probe,
which provides an inductive bias that enables the model to successfully learn
multi-digit multiplication. In summary, by reverse-engineering the mechanisms
of an implicit chain-of-thought model we uncover a pitfall for learning
long-range dependencies in Transformers and provide an example of how the
correct inductive bias can address this issue.