ChatPaper.aiChatPaper

为何Transformer无法学习乘法?逆向工程揭示长程依赖陷阱

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

September 30, 2025
作者: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viégas, Martin Wattenberg, Andrew Lee
cs.AI

摘要

语言模型的能力日益增强,但在看似简单的多位数乘法任务上仍显不足。本研究通过逆向工程一个通过隐式思维链成功学习乘法的模型,探讨了其原因,并报告了三个发现:(1)长程结构的证据:Logit归因和线性探针表明,模型编码了多位数乘法所需的长程依赖关系。(2)机制:模型利用注意力机制构建有向无环图来“缓存”和“检索”成对的局部积,以此编码长程依赖。(3)几何特性:模型在注意力头中通过形成数字对的闵可夫斯基和来实现局部积,并使用傅里叶基表示数字,这些都是直观且高效的表示方式,而标准微调模型则缺乏这些特性。基于这些洞见,我们重新审视了标准微调的学习动态,发现模型收敛于一个缺乏必要长程依赖的局部最优解。我们进一步通过引入一个辅助损失函数来验证这一理解,该函数通过线性回归探针预测“运行和”,提供了使模型成功学习多位数乘法的归纳偏置。总之,通过逆向工程隐式思维链模型的机制,我们揭示了Transformer在学习长程依赖时的一个陷阱,并展示了正确的归纳偏置如何解决这一问题。
English
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via implicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
PDF153October 2, 2025