Horizon-LM：面向大语言模型训练的RAM中心化架构

摘要

大型语言模型（LLM）的快速发展已超越单GPU硬件的演进速度，使得模型规模日益受限于内存容量而非计算能力。尽管现代训练系统通过分布式并行及跨CPU与存储层级的卸载技术扩展GPU内存，但其本质上仍维持以GPU为核心的执行范式：GPU需承载持久的模型副本和完整自动微分图。这导致大规模模型训练仍紧密依赖多GPU集群、复杂分布式运行时系统以及不可预测的主机内存消耗，为节点级训练后工作负载（如指令调优、对齐和领域自适应）设置了巨大障碍。我们提出Horizon-LM这一以内存为中心的训练系统，通过重新定义CPU与GPU在大模型优化中的角色突破现有局限。该系统将主机内存作为权威参数存储库，采用CPU主导、GPU协从的执行模式，仅将GPU作为瞬态计算引擎。通过消除持久性GPU驻留模块与自动微分图、采用手动梯度传播的显式重计算技术，并引入流水线双缓冲执行引擎，Horizon-LM实现了模型规模与GPU数量的解耦，将内存使用量严格约束在理论参数空间内。在配备1.5TB主机内存的单个H200 GPU上，Horizon-LM可稳定训练参数量高达1200亿的模型。在标准单A100设备上，其训练吞吐量较DeepSpeed ZeRO-3结合CPU卸载方案提升最高达12.2倍，且保持数值精度无损。跨平台与多尺度实验表明，Horizon-LM能持续维持高设备利用率和可预测的内存增长，证实主机内存（而非GPU内存）才是节点级大模型训练可行性的真正边界。

English

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2times higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Horizon-LM：面向大语言模型训练的RAM中心化架构

Horizon-LM: A RAM-Centric Architecture for LLM Training

摘要

Support