Horizon-LM: LLM学習のためのRAM中心アーキテクチャ

要旨

大規模言語モデル(LLM)の急速な発展は、単一GPUハードウェアの進化速度を上回り、モデル規模が計算能力ではなくメモリ容量によって制約されるケースが増えている。現代のトレーニングシステムは、分散並列処理やCPU・ストレージ階層へのオフロードを通じてGPUメモリを拡張するが、基本的にはGPU中心の実行パラダイムを維持しており、GPUが永続的なモデルレプリカと完全な自動微分グラフを保持する。その結果、大規模モデルのスケーリングは、マルチGPUクラスター、複雑な分散ランタイム、予測不能なホストメモリ消費量と強く結びついたままであり、命令チューニング、アライメント、ドメイン適応などのノード規模における学習後ワークロードに対する大きな障壁となっている。本論文では、大規模モデル最適化におけるCPUとGPUの役割を再定義する、メモリ中心のトレーニングシステム「Horizon-LM」を提案する。Horizon-LMは、ホストメモリを信頼できるパラメータストアとして扱い、GPUをCPU主導・GPU従属の実行モデルを通じて一時的な計算エンジンとしてのみ利用する。永続的なGPU常駐モジュールと自動微分グラフを排除し、手動勾配伝播による明示的再計算を採用し、パイプライン化されたダブルバッファリング実行エンジンを導入することで、Horizon-LMはモデル規模とGPU台数を分離し、メモリ使用量を理論的なパラメータ容量に抑える。1.5TBのホストRAMを搭載した単一H200 GPU上で、Horizon-LMは1200億パラメータまでのモデルを確実に学習する。標準的な単一A100マシンでは、Horizon-LMはDeepSpeed ZeRO-3（CPUオフロード）と比較して最大12.2倍の学習スループットを達成し、数値的正確性を維持する。様々なプラットフォームと規模において、Horizon-LMは高いデバイス使用率と予測可能なメモリ増加を維持し、ノード規模の大規模モデル学習の真の実現可能性の境界を定義するのはGPUメモリではなくホストメモリであることを実証する。

English

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2times higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Horizon-LM: LLM学習のためのRAM中心アーキテクチャ

Horizon-LM: A RAM-Centric Architecture for LLM Training

要旨

Support