在AMD全栈平台上训练基础模型:计算、网络与系统设计
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
November 21, 2025
作者: Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge
cs.AI
摘要
我们首次在纯AMD硬件上开展了大规模专家混合模型(MoE)预训练研究,同时利用配备Pollara互联技术的MI300X GPU。本研究提炼出系统与模型设计的实用指南。在系统层面,我们提供了完整的集群与网络特性分析:针对Pollara互联环境下不同消息大小和GPU数量的所有核心集合通信操作(全归约、规约散射、全收集、广播)进行了微基准测试。据我们所知,这是该规模下的首次全面测试。我们进一步提供了MI300X在核心规模与内存带宽方面的微基准数据,为模型设计提供参考。在模型层面,我们引入并应用了针对MI300X优化的Transformer规模配置规则,涵盖注意力机制与MLP模块,同时论证了能协同优化训练吞吐量与推理延迟的MoE宽度配置。我们深入阐述了训练技术栈,包括常被忽视的容错机制和检查点重塑等实用工具,并详细介绍了训练方案。此外,我们首次披露模型架构与基础模型ZAYA1(激活参数7.6亿,总参数83亿的MoE模型)的预览,该模型将在后续论文中持续优化。ZAYA1基础版在同等乃至更大规模下,其性能可与Qwen3-4B、Gemma3-12B等领先基础模型相媲美,并在推理、数学和代码基准测试中超越Llama-3-8B、OLMoE等模型。这些成果共同证明AMD硬件、网络及软件栈已臻成熟,足以支持具有竞争力的大规模预训练任务。
English
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.