Mobile-O: モバイルデバイスにおける統合マルチモーダル理解・生成フレームワーク

要旨

統合型マルチモーダルモデルは、単一のアーキテクチャ内で視覚コンテンツの理解と生成の両方を可能にします。しかし、既存のモデルはデータ要求量が高く、エッジデバイスへの実装には過重です。本論文では、モバイルデバイスに統合マルチモーダル知能をもたらすコンパクトな視覚-言語-拡散モデル「Mobile-O」を提案します。中核モジュールであるMobile Conditioning Projector（MCP）は、深度分離可能畳み込みと階層アライメントを用いて、視覚-言語特徴を拡散生成器と融合させます。この設計により、最小限の計算コストで効率的なクロスモーダル条件付けを実現します。わずか数百万サンプルの学習と、新規の四重形式（生成プロンプト、画像、質問、回答）による事後学習を通じて、Mobile-Oは視覚的理解と生成能力を同時に強化します。効率性にもかかわらず、Mobile-Oは他の統合モデルと比較して競争力ある性能を発揮し、GenEvalで74％を達成、Show-OおよびJanusFlowをそれぞれ5％、11％上回り、処理速度は6倍および11倍高速でした。視覚的理解では、7つのベンチマーク平均で15.3％および5.1％優れています。iPhoneで512x512画像の処理に約3秒しか要さないMobile-Oは、エッジデバイスにおけるリアルタイム統合マルチモーダル理解・生成の初の実用的フレームワークを確立します。Mobile-Oが、クラウド依存なしで完全オンデバイス動作するリアルタイム統合マルチモーダル知能の今後の研究を促進することを期待します。コード、モデル、データセット、モバイルアプリケーションはhttps://amshaker.github.io/Mobile-O/で公開しています。

English

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Mobile-O: モバイルデバイスにおける統合マルチモーダル理解・生成フレームワーク

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

要旨

Support