ChatPaper.aiChatPaper

Lance: 透過多任務協同的統一多模態建模

Lance: Unified Multimodal Modeling by Multi-Task Synergy

May 18, 2026
作者: Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang
cs.AI

摘要

我們提出Lance,這是一個輕量級的原生統一模型,支援圖片與影片的多模態理解、生成與編輯。不同於依賴模型規模擴展或文字-圖片主導的設計,Lance透過協作式多任務訓練,探索一套適用於統一多模態建模的實用範式。其基礎建立在兩項核心原則上:統一上下文建模與解耦能力路徑。具體而言,Lance從零開始訓練,並採用雙流混合專家架構作用於共享的交錯多模態序列,在解碼理解與生成路徑的同時,實現聯合上下文學習。我們進一步引入模態感知旋轉位置編碼,以減輕異質視覺標記間的干擾,並提升跨任務的對齊效果。訓練過程中,Lance採用分階段的多任務訓練範式,搭配能力導向的目標函數與適應性資料排程,強化語意理解與視覺生成效能。實驗結果顯示,Lance在圖片與影片生成方面明顯超越現有開源統一模型,同時保有強大的多模態理解能力。首頁網址為 https://lance-project.github.io。
English
We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.