VeOmni：通過模型中心分佈式配方庫擴展任意模態模型訓練

摘要

近期，大型语言模型（LLMs）的进展推动了全模态理解与生成的显著进步。然而，由于处理多种模态所需的异构模型架构，训练全模态LLMs仍面临重大挑战，这要求进行复杂的系统设计以实现高效的大规模训练。现有框架通常将模型定义与并行逻辑紧密耦合，导致可扩展性受限，并为端到端全模态训练带来大量工程开销。我们提出了一种模块化且高效的训练框架——\veomni，以加速全模态LLMs的开发。\veomni引入了以模型为中心的分布式方案，将通信与计算解耦，从而在全模态LLMs上实现高效的三维并行。\veomni还具备灵活的配置接口，支持以最少的代码变更无缝集成新模态。使用\veomni，一个拥有300亿参数的全模态专家混合（MoE）模型可在128个GPU上通过三维并行实现超过2,800 tokens/sec/GPU的吞吐量，并扩展至160K上下文长度，展示了其在训练大规模全模态LLMs方面的卓越效率与可扩展性。

English

Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

VeOmni：通過模型中心分佈式配方庫擴展任意模態模型訓練

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

摘要

Support