StreamBP:面向长序列大语言模型训练的内存高效精确反向传播
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs
June 3, 2025
作者: Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
cs.AI
摘要
訓練語言模型於長序列數據上,是提升模型在複雜任務(如長鏈推理)能力的一項苛刻要求。然而,隨著序列長度的增加,在反向傳播(BP)過程中存儲激活值所需的記憶體成本變得巨大,即便應用梯度檢查點技術也難以緩解。為應對這一挑戰,我們提出了一種記憶體效率高且精確的BP方法——StreamBP,該方法沿序列維度逐層進行鏈式法則的線性分解,顯著降低了激活值和邏輯值的記憶體成本。所提出的方法適用於常見的目標函數,如SFT、GRPO和DPO。從實現角度來看,StreamBP通過利用語言模型的因果結構,實現了更少的計算FLOPs和更快的BP速度。與梯度檢查點相比,StreamBP將BP的最大序列長度擴展了2.8至5.5倍,同時使用相當甚至更少的BP時間。值得注意的是,StreamBP的序列長度擴展能力可直接轉化為批量大小擴展,以加速訓練。我們進一步開發了一種通信效率高的分布式StreamBP,有效支持多GPU訓練並擴大其適用範圍。我們的代碼可輕鬆集成到任何Transformer模型的訓練流程中,並可在https://github.com/Ledzy/StreamBP獲取。
English
Training language models on long sequence data is a demanding requirement for
enhancing the model's capability on complex tasks, e.g., long-chain reasoning.
However, as the sequence length scales up, the memory cost for storing
activation values becomes huge during the Backpropagation (BP) process, even
with the application of gradient checkpointing technique. To tackle this
challenge, we propose a memory-efficient and exact BP method called StreamBP,
which performs a linear decomposition of the chain rule along the sequence
dimension in a layer-wise manner, significantly reducing the memory cost of
activation values and logits. The proposed method is applicable to common
objectives such as SFT, GRPO, and DPO. From an implementation perspective,
StreamBP achieves less computational FLOPs and faster BP speed by leveraging
the causal structure of the language model. Compared to gradient checkpointing,
StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger,
while using comparable or even less BP time. Note that StreamBP's sequence
length scaling ability can be directly transferred to batch size scaling for
accelerating training. We further develop a communication-efficient distributed
StreamBP to effectively support multi-GPU training and broaden its
applicability. Our code can be easily integrated into the training pipeline of
any transformer models and is available at https://github.com/Ledzy/StreamBP.