變壓器何時學會圖形連通性的啟發式方法？

摘要

Transformer模型往往难以学习到可推广的算法，而是依赖于脆弱的启发式方法。以图连通性为测试平台，我们从理论和实证两方面解释了这一现象。我们考虑了一种简化的Transformer架构——解耦Transformer，并证明了L层模型能够精确解决直径不超过3^L的图问题，其实现的算法等同于计算邻接矩阵的幂次。我们分析了训练动态，发现学习策略的关键在于大多数训练实例是否处于模型的能力范围内。对于能力范围内的图（直径≤3^L），模型能够学习到正确的算法解决方案；而对于超出能力范围的图，模型则倾向于学习基于节点度的简单启发式方法。最后，我们通过实验证明，将训练数据限制在模型能力范围内，无论是标准Transformer还是解耦Transformer，都能学习到精确的算法，而非基于节点度的启发式方法。

English

Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an L-layer model has capacity to solve for graphs with diameters up to exactly 3^L, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter leq 3^L) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

變壓器何時學會圖形連通性的啟發式方法？

When Do Transformers Learn Heuristics for Graph Connectivity?

摘要

Support