更优模型，更快训练：单细胞基础模型中的S形注意力机制

摘要

训练稳定的生物基础模型需要重新思考注意力机制：我们发现使用S型注意力作为Softmax注意力的即插即用替代方案具有以下优势：a) 生成更优的学习表征——在六个多样化单细胞数据集上，S型注意力实现了细胞类型分离度提升25%，更佳的细胞类型内聚指标以及更低的验证损失；b) 训练速度更快——S型注意力模型的训练速度较Softmax版本提升高达10%；c) 训练更稳定——通过消除Softmax注意力固有的不稳定性来源。我们证实S型注意力具有全局有界导数（≤0.25），而Softmax则不具备此特性；其雅可比矩阵呈对角结构，与Softmax的稠密耦合形成鲜明对比，这些特性共同助力缓解训练不稳定性。在针对1.6亿参数双向注意力模型进行的压力测试中（使用8K标记序列且未采用梯度裁剪），Softmax会出现灾难性发散，梯度爆炸达四个数量级，而S型注意力始终保持稳定。最后，我们实现并开源了TritonSigmoid——一个高效GPU内核，在H100 GPU上达到515 TFLOPS，性能超越FlashAttention-2和FlashSigmoid，并原生支持对生物序列至关重要的填充处理。我们的研究结果表明，S型注意力在理论基础上坚实，在生物基础模型的实证研究中表现卓越。代码详见：https://github.com/MSDLLCpapers/triton-sigmoid

English

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid

更优模型，更快训练：单细胞基础模型中的S形注意力机制

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

摘要

Support