Mobile-O: 모바일 기기에서의 통합 멀티모달 이해 및 생성

초록

통합 멀티모달 모델은 단일 아키텍처 내에서 시각 콘텐츠를 이해하고 생성할 수 있습니다. 그러나 기존 모델들은 여전히 데이터 요구량이 크고 에지 기기에 배포하기에는 부담스러운 규모입니다. 본 논문에서는 모바일 기기에 통합 멀티모달 인텔리전스를 제공하는 컴팩트한 비전-언어-확산 모델인 Mobile-O를 제안합니다. Mobile-O의 핵심 모듈인 Mobile Conditioning Projector(MCP)는 깊이별 분리 합성곱(depthwise-separable convolutions)과 계층별 정렬(layerwise alignment)을 사용하여 비전-언어 특징을 확산 생성기와 융합합니다. 이 설계는 최소한의 계산 비용으로 효율적인 크로스모달 조건화를 가능하게 합니다. 수백만 개의 샘플만으로 학습되고 생성 프롬프트, 이미지, 질문, 답변으로 구성된 새로운 4중 형식(quadruplet format)으로 사후 학습되어 Mobile-O는 시각 이해와 생성 능력을 공동으로 향상시킵니다. 효율성에도 불구하고, Mobile-O는 다른 통합 모델들과 비교하여 경쟁력 있거나 우수한 성능을 달성했습니다. GenEval에서 74%를 기록했으며, Show-O와 JanusFlow보다 각각 5%, 11% 높은 성능을 보였고, 각각 6배, 11배 더 빠른 속도를 보였습니다. 시각 이해 측면에서는 7개 벤치마크 평균에서 각각 15.3%, 5.1% 앞섰습니다. iPhone에서 512x512 이미지당 약 3초만에 실행되는 Mobile-O는 에지 기기에서 실시간 통합 멀티모달 이해와 생성을 위한 최초의 실용적인 프레임워크를确立합니다. Mobile-O가 클라우드 의존성 없이 완전히 온디바이스에서 실행되는 실시간 통합 멀티모달 인텔리전스 향후 연구에 기여하기를 바랍니다. 코드, 모델, 데이터셋 및 모바일 애플리케이션은 https://amshaker.github.io/Mobile-O/에서 공개되어 있습니다.

English

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Mobile-O: 모바일 기기에서의 통합 멀티모달 이해 및 생성

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

초록

Support