그래프 연결성에 대한 휴리스틱을 트랜스포머는 언제 학습하는가?

초록

트랜스포머는 종종 일반화 가능한 알고리즘을 학습하는 데 실패하고, 대신 취약한 휴리스틱에 의존하는 경향이 있다. 그래프 연결성을 테스트베드로 사용하여, 우리는 이 현상을 이론적 및 실증적으로 설명한다. 우리는 단순화된 트랜스포머 아키텍처인 분리된 트랜스포머를 고려하고, L-레이어 모델이 최대 3^L까지의 직경을 가진 그래프를 해결할 수 있는 능력을 가지고 있음을 증명한다. 이는 인접 행렬의 거듭제곱을 계산하는 것과 동등한 알고리즘을 구현한다. 우리는 학습 동역학을 분석하고, 학습된 전략이 대부분의 학습 인스턴스가 이 모델의 능력 범위 내에 있는지 여부에 달려 있음을 보여준다. 능력 범위 내의 그래프(직경 ≤ 3^L)는 올바른 알고리즘적 해결책을 학습하도록 이끌지만, 능력 범위를 벗어난 그래프는 노드 차수에 기반한 단순한 휴리스틱을 학습하도록 이끈다. 마지막으로, 우리는 학습 데이터를 모델의 능력 범위 내로 제한하는 것이 표준 트랜스포머와 분리된 트랜스포머 모두가 차수 기반 휴리스틱이 아닌 정확한 알고리즘을 학습하도록 이끈다는 것을 실증적으로 보여준다.

English

Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an L-layer model has capacity to solve for graphs with diameters up to exactly 3^L, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter leq 3^L) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

그래프 연결성에 대한 휴리스틱을 트랜스포머는 언제 학습하는가?

When Do Transformers Learn Heuristics for Graph Connectivity?

초록

Support