Ming-Omni: 인지와 생성을 위한 통합 멀티모달 모델

초록

우리는 이미지, 텍스트, 오디오, 비디오를 처리할 수 있는 통합 멀티모달 모델인 Ming-Omni를 제안하며, 이 모델은 음성 및 이미지 생성에서도 뛰어난 성능을 보입니다. Ming-Omni는 각기 다른 모달리티에서 토큰을 추출하기 위해 전용 인코더를 사용하며, 이 토큰들은 새롭게 제안된 모달리티 특화 라우터를 갖춘 MoE(Mixture of Experts) 아키텍처인 Ling에 의해 처리됩니다. 이 설계는 단일 모델이 통합 프레임워크 내에서 멀티모달 입력을 효율적으로 처리하고 융합할 수 있게 하여, 별도의 모델이나 작업별 미세 조정, 구조적 재설계 없이도 다양한 작업을 수행할 수 있도록 합니다. 특히, Ming-Omni는 기존의 멀티모달 모델을 넘어 오디오와 이미지 생성을 지원합니다. 이는 자연스러운 음성 생성을 위한 고급 오디오 디코더와 고품질 이미지 생성을 위한 Ming-Lite-Uni의 통합을 통해 달성되며, 이를 통해 모델은 상황 인지 채팅, 텍스트-음성 변환, 다양한 이미지 편집 작업을 수행할 수 있습니다. 실험 결과는 Ming-Omni가 모든 모달리티에 걸친 통합 인식 및 생성에 대한 강력한 솔루션을 제공함을 보여줍니다. 특히, 우리가 제안한 Ming-Omni는 GPT-4o와 동등한 모달리티 지원을 제공하는 최초의 오픈소스 모델로, 커뮤니티의 추가 연구 및 개발을 촉진하기 위해 모든 코드와 모델 가중치를 공개합니다.

English

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

Ming-Omni: 인지와 생성을 위한 통합 멀티모달 모델

Ming-Omni: A Unified Multimodal Model for Perception and Generation

초록

Support