快速与简洁：Triton中的2-单纯形注意力机制

摘要

近期研究表明，训练损失随模型规模和token数量呈幂律关系扩展，且实现计算最优模型需要同步扩展模型规模和token数量。然而，这些扩展定律假设数据供应无限，并主要适用于计算受限的场景。随着现代大型语言模型日益依赖海量的互联网规模数据集，它们处于计算受限的假设正变得不再成立。这一转变凸显了对优先考虑token效率的架构的需求。在本研究中，我们探讨了2-单纯形Transformer的应用，该架构通过高效的Triton内核实现，将标准点积注意力推广至三线性函数。我们证明，2-单纯形Transformer相比标准Transformer具有更好的token效率：在固定token预算下，规模相近的模型在涉及数学、编程、推理和逻辑的任务上表现优于其点积注意力版本。我们通过展示2-单纯形注意力相较于点积注意力，在知识和推理任务的扩展定律中改变了指数，量化了这些提升。

English

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that 2-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

快速与简洁：Triton中的2-单纯形注意力机制

Fast and Simplex: 2-Simplicial Attention in Triton

摘要

Support