利用RoundPipe在多个消费级GPU上实现高效训练

摘要

在消费级GPU上微调大语言模型具有显著的成本效益，但受限于GPU显存不足与PCIe互连速度较慢的瓶颈。通过流水线并行与CPU卸载相结合的策略，可有效降低通信开销以缓解硬件限制。然而现有流水线调度方案存在固有缺陷——权重绑定问题。当将不均衡的模型阶段（如庞大的语言模型头）绑定至GPU时，流水线吞吐量受限于负载最重的GPU设备，导致严重的流水线气泡现象。本文提出RoundPipe这一创新流水线调度方案，旨在突破消费级GPU服务器的权重绑定限制。该方案将GPU视为无状态执行工作节点池，以轮询方式动态分配计算阶段至各设备，实现接近零气泡的流水线运行。为确保训练正确性与系统效率，RoundPipe集成了优先级感知传输调度引擎、基于分布式事件的细粒度同步协议，以及自动化分层分区算法。在8张RTX 4090服务器的测试表明，当微调1.7B至32B参数模型时，RoundPipe相较现有最优基线可实现1.48-2.16倍的加速效果。尤为突出的是，该方案支持在单台服务器上完成Qwen3-235B模型（序列长度31K）的LoRA微调。 RoundPipe已作为开源Python库公开发布，并提供完整技术文档。

English

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round-robin manner, achieving a near-zero-bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority-aware transfer scheduling engine, a fine-grained distributed event-based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8times RTX 4090 server demonstrate that RoundPipe achieves 1.48--2.16times speedups over state-of-the-art baselines when fine-tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine-tuning of the Qwen3-235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open-source Python library with comprehensive documentation.

利用RoundPipe在多个消费级GPU上实现高效训练

Efficient Training on Multiple Consumer GPUs with RoundPipe

摘要

Support