为何Transformer无法学习乘法？逆向工程揭示长程依赖陷阱

摘要

语言模型的能力日益增强，但在看似简单的多位数乘法任务上仍显不足。本研究通过逆向工程一个通过隐式思维链成功学习乘法的模型，探讨了其原因，并报告了三个发现：（1）长程结构的证据：Logit归因和线性探针表明，模型编码了多位数乘法所需的长程依赖关系。（2）机制：模型利用注意力机制构建有向无环图来“缓存”和“检索”成对的局部积，以此编码长程依赖。（3）几何特性：模型在注意力头中通过形成数字对的闵可夫斯基和来实现局部积，并使用傅里叶基表示数字，这些都是直观且高效的表示方式，而标准微调模型则缺乏这些特性。基于这些洞见，我们重新审视了标准微调的学习动态，发现模型收敛于一个缺乏必要长程依赖的局部最优解。我们进一步通过引入一个辅助损失函数来验证这一理解，该函数通过线性回归探针预测“运行和”，提供了使模型成功学习多位数乘法的归纳偏置。总之，通过逆向工程隐式思维链模型的机制，我们揭示了Transformer在学习长程依赖时的一个陷阱，并展示了正确的归纳偏置如何解决这一问题。

English

Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via implicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

为何Transformer无法学习乘法？逆向工程揭示长程依赖陷阱

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

摘要

Support