ChatPaper.aiChatPaper

Mobile-O:移動設備上的統一多模態理解與生成

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

February 23, 2026
作者: Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan
cs.AI

摘要

統一多模態模型能夠在單一架構中同時理解與生成視覺內容。然而現有模型仍存在數據需求量大、體積過重而難以部署於邊緣設備的問題。我們提出Mobile-O——一款緊湊型視覺-語言-擴散模型,將統一多模態智能帶入移動設備。其核心模塊Mobile Conditioning Projector(MCP)通過深度可分離卷積與層級對齊技術,將視覺-語言特徵與擴散生成器融合。該設計能以最小計算成本實現高效的跨模態條件控制。僅需數百萬樣本訓練並採用新穎的四元組格式(生成提示、圖像、問題、答案)進行後訓練,Mobile-O即可同步增強視覺理解與生成能力。儘管追求高效,Mobile-O在性能上仍與其他統一模型相當或更優:在GenEval達到74%的評分,分別超越Show-O和JanusFlow達5%和11%,且推理速度加快6倍與11倍。在視覺理解任務中,Mobile-O於七項基準測試的平均表現超出兩者15.3%和5.1%。在iPhone上僅需約3秒即可生成512x512圖像,Mobile-O建立了首個實用的邊緣設備實時統一多模態理解與生成框架。我們期待Mobile-O能推動完全端側運行、無需雲端依賴的實時統一多模態智能研究。相關代碼、模型、數據集及移動應用已開源於:https://amshaker.github.io/Mobile-O/
English
Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
PDF182February 25, 2026