DeepSeek-V3에 대한 통찰: AI 아키텍처를 위한 하드웨어의 확장 과제와 성찰

초록

대규모 언어 모델(LLM)의 급속한 확장은 메모리 용량, 계산 효율성, 상호 연결 대역폭 등 현재 하드웨어 아키텍처의 중요한 한계를 드러냈습니다. 2,048개의 NVIDIA H800 GPU로 훈련된 DeepSeek-V3는 하드웨어 인식 모델 공동 설계가 이러한 과제를 효과적으로 해결하고, 규모에 맞는 비용 효율적인 훈련과 추론을 가능하게 하는 방법을 보여줍니다. 본 논문은 DeepSeek-V3/R1 모델 아키텍처와 AI 인프라에 대한 심층 분석을 제시하며, 향상된 메모리 효율성을 위한 Multi-head Latent Attention(MLA), 최적화된 계산-통신 트레이드오프를 위한 Mixture of Experts(MoE) 아키텍처, 하드웨어 성능의 잠재력을 최대한 활용하기 위한 FP8 혼합 정밀도 훈련, 클러스터 수준 네트워크 오버헤드를 최소화하는 Multi-Plane Network Topology와 같은 주요 혁신을 강조합니다. DeepSeek-V3 개발 과정에서 마주한 하드웨어 병목 현상을 바탕으로, 학계 및 산업 동료들과 함께 정밀한 저정밀도 계산 유닛, 스케일업과 스케일아웃의 융합, 저지연 통신 패브릭의 혁신 등 잠재적인 미래 하드웨어 방향에 대한 광범위한 논의를 진행합니다. 이러한 통찰은 AI 워크로드의 증가하는 요구를 충족시키기 위한 하드웨어와 모델 공동 설계의 중요한 역할을 강조하며, 차세대 AI 시스템의 혁신을 위한 실용적인 청사진을 제공합니다.

English

The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.