Ming-Omni：知覚と生成のための統合マルチモーダルモデル

要旨

我々は、画像、テキスト、音声、動画を処理可能な統一マルチモーダルモデルであるMing-Omniを提案します。Ming-Omniは、音声と画像生成の両方において高い能力を発揮します。Ming-Omniは、異なるモダリティからトークンを抽出する専用エンコーダを採用し、新たに提案されたモダリティ固有のルーターを備えたMoEアーキテクチャであるLingによって処理されます。この設計により、単一のモデルが統一されたフレームワーク内で効率的にマルチモーダル入力を処理・融合し、個別のモデル、タスク固有のファインチューニング、または構造的再設計を必要とせずに多様なタスクを可能にします。重要なことに、Ming-Omniは従来のマルチモーダルモデルを超え、音声と画像生成をサポートします。これは、自然な音声を生成する高度な音声デコーダと高品質な画像生成を可能にするMing-Lite-Uniの統合によって実現され、コンテキストを意識したチャット、テキストから音声への変換、多様な画像編集を実行できます。我々の実験結果は、Ming-Omniが全てのモダリティにわたる統一的な知覚と生成のための強力なソリューションを提供することを示しています。特に、提案するMing-Omniは、我々が知る限りGPT-4oのモダリティサポートに匹敵する最初のオープンソースモデルであり、コミュニティにおけるさらなる研究開発を促進するために全てのコードとモデル重みを公開します。

English

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

Ming-Omni：知覚と生成のための統合マルチモーダルモデル

Ming-Omni: A Unified Multimodal Model for Perception and Generation

要旨

Support