Horizon-LM: 대규모 언어 모델 학습을 위한 RAM 중심 아키텍처

초록

대규모 언어 모델(LLM)의 급속한 성장은 단일 GPU 하드웨어의 진화 속도를 앞지르며, 모델 규모가 점차 연산 능력이 아닌 메모리 용량에 의해 제약받게 되었습니다. 현대의 학습 시스템은 분산 병렬화 및 CPU와 스토리지 계층을 통한 오프로딩을 통해 GPU 메모리를 확장하지만, 근본적으로 GPU가 지속적인 모델 복제본과 완전한 autograd 그래프를 호스팅하는 GPU 중심 실행 패러다임을 유지하고 있습니다. 그 결과, 대규모 모델 확장은 다중 GPU 클러스터, 복잡한 분산 런타임, 예측 불가능한 호스트 메모리 소비와 밀접하게 결합되어 지시 튜닝, 정렬, 도메인 적응과 같은 노드 규모의 학습 후 작업에 상당한 장벽을 만들어냅니다. 본 논문에서는 대규모 모델 최적화를 위해 CPU와 GPU의 역할을 재정의하는 메모리 중심 학습 시스템인 Horizon-LM을 제시합니다. Horizon-LM은 호스트 메모리를 권위 있는 파라미터 저장소로 간주하고 GPU는 CPU-마스터, GPU-템플릿 실행 모델을 통해 일시적인 컴퓨팅 엔진으로만 사용합니다. 지속적인 GPU 상주 모듈과 autograd 그래프를 제거하고, 수동 기울기 전파를 통한 명시적 재계산을 채택하며, 파이프라인된 이중 버퍼 실행 엔진을 도입함으로써, Horizon-LM은 모델 규모를 GPU 개수에서 분리하고 메모리 사용량을 이론적 파라미터 용량으로 제한합니다. 1.5TB 호스트 RAM을 갖춘 단일 H200 GPU에서 Horizon-LM은 120B 파라미터 규모의 모델을 안정적으로 학습합니다. 표준 단일 A100 머신에서 Horizon-LM은 수치적 정확도를 유지하면서 CPU 오프로딩을 사용하는 DeepSpeed ZeRO-3 대비 최대 12.2배 높은 학습 처리량을 달성합니다. 다양한 플랫폼과 규모에서 Horizon-LM은 높은 장치 활용률과 예측 가능한 메모리 증가를 유지하며, 노드 규모의 대규모 모델 학습에 대한 진정한 실현 가능성 경계를 정의하는 것은 GPU 메모리가 아닌 호스트 메모리임을 입증합니다.

English

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2times higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Horizon-LM: 대규모 언어 모델 학습을 위한 RAM 중심 아키텍처

Horizon-LM: A RAM-Centric Architecture for LLM Training

초록

Support