Canzona：一种统一、异步且负载均衡的分布式矩阵优化器框架

摘要

大规模语言模型（LLMs）的扩展推动了对基于矩阵的优化器（如Shampoo、Muon、SOAP）的关注，因其收敛效率优势显著；然而这类优化器要求整体参数更新，与Megatron等分布式框架中的张量分片策略存在根本冲突。现有解决方案存在局限：同步方法会产生计算冗余，而分层划分策略虽能缓解冲突，却会破坏高效通信原语的几何约束。为弥补这一鸿沟，我们提出Canzona——一个统一、异步且负载均衡的框架，其将逻辑优化器分配与物理参数分布解耦。针对数据并行场景，我们提出α均衡静态划分策略，在保持参数原子性的同时消除负载不均。针对张量并行场景，我们设计基于微组调度的异步计算流水线，通过批量处理分片化更新隐藏重构开销。在256张GPU上对Qwen3模型族（最高320亿参数）的广泛实验表明，我们的方案在保持现有并行架构效率的同时，实现了端到端迭代时间1.57倍加速，并将优化器步长延迟降低5.8倍。

English

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.