더 나은 모델, 더 빠른 학습: 단일세포 파운데이션 모델을 위한 시그모이드 어텐션

초록

안정적인 생물학 기초 모델 학습에는 어텐션 메커니즘에 대한 재고가 필요합니다: 우리는 소프트맥스 어텐션을 대체하여 시그모이드 어텐션을 사용할 경우 a) 더 우수한 학습된 표현을 생성함을 확인했습니다. 6개의 다양한 단일 세포 데이터셋에서 시그모이드는 25% 더 높은 세포 유형 분리도, 더 나은 세포 유형 응집성 메트릭, 그리고 더 낮은 검증 손실을 달성했습니다. b) 더 빠른 학습: 시그모이드 어텐션을 사용한 모델은 소프트맥스 대비 최대 10% 더 빠르게 학습하며, c) 소프트맥스 어텐션의 내재적 불안정성 원인을 제거하여 더 안정적인 학습을 가능하게 합니다. 우리는 시그모이드 어텐션이 소프트맥스와 달리 전역적으로 유계된 도함수(≤ 0.25)를 가지며, 소프트맥스의 조밀한 결합 구조와 대조적으로 대각 야코비안 구조를 가져 학습 불안정성을 완화하는 데 함께 기여함을 입증했습니다. 8K-토큰 시퀀스에서 그래디언트 클리핑 없이 학습된 1억 6천만 파라미터 양방향 어텐션 모델에 대한 스트레스 테스트에서 소프트맥스는 그래디언트가 4배수로 폭발하며 치명적으로 발산한 반면, 시그모이드는 안정적으로 유지되었습니다. 마지막으로, 우리는 H100 GPU에서 515 TFLOPS를 달성하여 FlashAttention-2와 FlashSigmoid를 모두 능가하고 생물학적 시퀀스에 필수적인 기본 패딩 지원을 갖춘 효율적인 GPU 커널인 TritonSigmoid를 구현 및 오픈소스로 공개합니다. 우리의 결과는 시그모이드 어텐션이 생물학 기초 모델에 대해 이론적으로 타당하고 경험적으로 우수함을 입증합니다. 코드는 https://github.com/MSDLLCpapers/triton-sigmoid에서 확인할 수 있습니다.

English

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid

더 나은 모델, 더 빠른 학습: 단일세포 파운데이션 모델을 위한 시그모이드 어텐션

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

초록

Support