ChatPaper.aiChatPaper

移动端-O:移动设备上的统一多模态理解与生成

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

February 23, 2026
作者: Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan
cs.AI

摘要

统一多模态模型能够在单一架构中同时理解与生成视觉内容。然而现有模型仍存在数据需求量大、体量过重而难以部署于边缘设备的问题。我们推出Mobile-O——一款紧凑型视觉-语言-扩散模型,将统一多模态智能引入移动终端。其核心模块Mobile Conditioning Projector(MCP)通过深度可分离卷积与层级对齐技术,将视觉-语言特征与扩散生成器相融合。该设计以最小计算成本实现了高效的跨模态条件控制。仅需数百万样本训练并结合新颖的四元组格式(生成提示、图像、问题、答案)进行后训练,Mobile-O即可同步增强视觉理解与生成能力。尽管追求高效,Mobile-O在性能上仍与主流统一模型持平甚至更优:在GenEval评测中达到74%,分别以5%和11%的优势超越Show-O与JanusFlow,推理速度更是快出6倍和11倍。在视觉理解任务中,Mobile-O在七项基准测试中的平均表现领先上述模型15.3%和5.1%。在iPhone上仅需约3秒即可生成512x512图像,Mobile-O首次构建了边缘设备实时统一多模态理解与生成的实用框架。我们期待Mobile-O能推动完全基于设备、无需云端依赖的实时统一多模态智能研究。代码、模型、数据集及移动应用已开源:https://amshaker.github.io/Mobile-O/
English
Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
PDF182February 25, 2026