Betere Modellen, Snellere Training: Sigmoid-Aandacht voor Single-Cell Foundation Models

Samenvatting

Het trainen van stabiele biologische foundation-modellen vereist een herziening van aandachtmechanismen: wij constateren dat het gebruik van sigmoid-attenatie als directe vervanging voor softmax-attenatie a) betere geleerde representaties oplevert: op zes diverse enkelceldatasets behaalt sigmoid 25% betere celtypescheiding, superieure cohesiemetrieken voor celtypen en een lagere validatiefout, b) snellere training: modellen met sigmoid-attenatie trainen tot 10% sneller dan hun softmax-equivalenten, en c) stabielere training door het elimineren van inherente bronnen van instabiliteit in softmax-attenatie. Wij tonen aan dat sigmoid-attenatie globaal begrensde afgeleiden (≤0.25) heeft in tegenstelling tot softmax, en een diagonale Jacobiaanstructuur bezit versus de dense koppeling bij softmax, wat gezamenlijk trainingsinstabiliteiten vermindert. In stresstests met bidirectionele aandachtmodellen van 160M parameters, getraind zonder gradient clipping op sequenties van 8K tokens, divergeert softmax catastrofaal met gradientschommelingen van vier grootteordes, terwijl sigmoid stabiel blijft. Tenslotte implementeren en open-sourcen wij TritonSigmoid, een efficiënte GPU-kernel die 515 TFLOPS bereikt op H100 GPU's en zowel FlashAttention-2 als FlashSigmoid overtreft, met native ondersteuning voor padding, wat essentieel is voor biologische sequenties. Onze resultaten positioneren sigmoid-attenatie als zowel theoretisch onderbouwd als empirisch superieur voor biologische foundation-modellen. Code is beschikbaar op https://github.com/MSDLLCpapers/triton-sigmoid.

English

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid

Betere Modellen, Snellere Training: Sigmoid-Aandacht voor Single-Cell Foundation Models

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

Samenvatting

Support