更优模型，更快训练：单细胞基础模型中的S形注意力机制

摘要

训练稳定的生物基础模型需要重新思考注意力机制：我们发现使用Sigmoid注意力作为Softmax注意力的替代方案具有以下优势：a) 学习表征质量更高——在六个多样化的单细胞数据集上，Sigmoid注意力实现了25%的细胞类型分离度提升、更优的细胞类型内聚指标及更低的验证损失；b) 训练速度更快——Sigmoid注意力模型训练速度较Softmax版本提升高达10%；c) 训练过程更稳定——通过消除Softmax注意力固有的不稳定性来源实现稳定训练。我们证实Sigmoid注意力具有全局有界导数（≤0.25），且其雅可比矩阵呈对角线结构，这与Softmax的密集耦合形成对比，共同缓解了训练不稳定性。在1.6亿参数双向注意力模型的压力测试中（使用8K标记序列且未采用梯度裁剪），Softmax会出现灾难性发散，梯度爆炸达四个数量级，而Sigmoid始终保持稳定。最后我们实现并开源了TritonSigmoid高效GPU内核，在H100 GPU上达到515 TFLOPS，性能超越FlashAttention-2和FlashSigmoid，并原生支持对生物序列至关重要的填充操作。我们的研究从理论和实证层面确立了Sigmoid注意力在生物基础模型中的优越性。代码详见https://github.com/MSDLLCpapers/triton-sigmoid。

English

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid