基于全栈AMD平台训练基础模型：计算、网络与系统设计

摘要

我们首次在纯AMD硬件上开展大规模专家混合模型预训练研究，同时采用配备Pollara互联技术的MI300X GPU。本研究提炼出系统与模型设计的实用指南。在系统层面，我们提供了完整的集群与网络特性分析：针对Pollara平台上不同消息规模和GPU数量的所有核心集合通信操作（全归约、规约散射、全收集、广播）进行微基准测试。据我们所知，这是该领域的首次大规模测试。我们进一步提供了MI300X在核心规模与内存带宽方面的微基准数据，为模型设计提供依据。在模型层面，我们引入并应用了针对MI300X优化的Transformer规模配置规则，涵盖注意力机制与MLP模块，同时论证了能协同优化训练吞吐与推理延迟的MoE宽度配置。我们深入阐述了训练技术栈，包括常被忽视的容错机制与检查点重塑等实用工具，并详细介绍了训练方案。此外还预览了我们的模型架构与基础模型——ZAYA1基础版（激活参数7.6亿，总参数83亿的MoE模型），该模型将在后续论文中持续优化。ZAYA1基础版在同等及更大规模下，其性能可比肩Qwen3-4B、Gemma3-12B等领先基础模型，并在推理、数学和代码基准测试中超越Llama-3-8B、OLMoE等模型。这些成果共同证明AMD硬件、网络及软件栈已足够成熟和优化，能够支撑具有竞争力的大规模预训练任务。

English

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

基于全栈AMD平台训练基础模型：计算、网络与系统设计

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

摘要

Support