빠르고 단순: Triton에서의 2-단순 주의 메커니즘

초록

최근 연구에 따르면, 훈련 손실은 모델 크기와 토큰 수에 대해 멱법칙(power law)으로 스케일링되며, 계산 최적의 모델을 달성하기 위해서는 모델 크기와 토큰 수를 함께 스케일링해야 한다는 것이 밝혀졌습니다. 그러나 이러한 스케일링 법칙은 무한한 데이터 공급을 가정하며, 주로 계산 제약(compute-bound) 환경에서 적용됩니다. 현대의 대규모 언어 모델들이 점점 더 거대한 인터넷 규모의 데이터셋에 의존함에 따라, 이들이 계산 제약 상태에 있다는 가정은 점점 더 유효하지 않게 되고 있습니다. 이러한 변화는 토큰 효율성을 우선시하는 아키텍처의 필요성을 강조합니다. 이 연구에서는 2-단순체 트랜스포머(2-simplicial Transformer)의 사용을 탐구합니다. 이 아키텍처는 표준 내적 어텐션(dot-product attention)을 삼선형(trilinear) 함수로 일반화하며, 효율적인 Triton 커널 구현을 통해 이를 달성합니다. 우리는 2-단순체 트랜스포머가 표준 트랜스포머보다 더 나은 토큰 효율성을 달성함을 보여줍니다: 고정된 토큰 예산 하에서, 유사한 크기의 모델들이 수학, 코딩, 추론 및 논리와 관련된 작업에서 내적 어텐션 기반 모델들을 능가합니다. 우리는 2-단순체 어텐션이 내적 어텐션과 비교하여 지식 및 추론 작업에 대한 스케일링 법칙의 지수를 변화시킴으로써 이러한 이점을 정량화합니다.

English

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that 2-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

빠르고 단순: Triton에서의 2-단순 주의 메커니즘

Fast and Simplex: 2-Simplicial Attention in Triton

초록

Support