StreamBP：面向长序列大语言模型训练的内存高效精确反向传播

摘要

在长序列数据上训练语言模型是提升模型处理复杂任务（如长链推理）能力的关键需求。然而，随着序列长度的增加，反向传播（BP）过程中存储激活值的内存开销变得极为庞大，即便应用了梯度检查点技术。为应对这一挑战，我们提出了一种内存高效且精确的反向传播方法——StreamBP，该方法沿序列维度逐层进行链式法则的线性分解，显著降低了激活值和逻辑值的内存消耗。所提方法适用于诸如SFT、GRPO和DPO等常见目标函数。从实现角度看，StreamBP通过利用语言模型的因果结构，实现了更少的计算浮点操作（FLOPs）和更快的反向传播速度。与梯度检查点相比，StreamBP将反向传播的最大序列长度扩展了2.8至5.5倍，同时使用相当甚至更少的反向传播时间。值得注意的是，StreamBP的序列长度扩展能力可直接转化为批量大小扩展，以加速训练。我们进一步开发了一种通信高效的分布式StreamBP，有效支持多GPU训练，拓宽了其应用范围。我们的代码可轻松集成到任何Transformer模型的训练流程中，代码已发布于https://github.com/Ledzy/StreamBP。

English

Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.

StreamBP：面向长序列大语言模型训练的内存高效精确反向传播

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

摘要

Support