VeOmni：通过模型中心化分布式配方库扩展任意模态模型训练

摘要

近期，大型语言模型（LLMs）的突破性进展推动了全模态理解与生成的显著进步。然而，训练全模态LLMs仍面临重大挑战，这主要源于处理多种模态所需的异构模型架构，以及实现高效大规模训练所需的复杂系统设计。现有框架通常将模型定义与并行逻辑紧密耦合，导致可扩展性受限，并为端到端全模态训练带来巨大的工程开销。我们提出了一种模块化且高效的训练框架，旨在加速全模态LLMs的开发。该框架引入了以模型为中心的分布式策略，将通信与计算解耦，从而在全模态LLMs上实现高效的三维并行。此外，该框架还具备灵活的配置接口，支持以最小代码改动无缝集成新模态。利用这一框架，一个拥有300亿参数的全模态专家混合（MoE）模型能够在128个GPU上通过三维并行实现超过每秒每GPU 2,800个令牌的训练吞吐量，并扩展至16万上下文长度，充分展示了其在训练大规模全模态LLMs方面卓越的效率和可扩展性。

English

Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

VeOmni：通过模型中心化分布式配方库扩展任意模态模型训练

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

摘要

Support