Lance: 透過多任務協同的統一多模態建模

摘要

我們提出Lance，這是一個輕量級的原生統一模型，支援圖片與影片的多模態理解、生成與編輯。不同於依賴模型規模擴展或文字-圖片主導的設計，Lance透過協作式多任務訓練，探索一套適用於統一多模態建模的實用範式。其基礎建立在兩項核心原則上：統一上下文建模與解耦能力路徑。具體而言，Lance從零開始訓練，並採用雙流混合專家架構作用於共享的交錯多模態序列，在解碼理解與生成路徑的同時，實現聯合上下文學習。我們進一步引入模態感知旋轉位置編碼，以減輕異質視覺標記間的干擾，並提升跨任務的對齊效果。訓練過程中，Lance採用分階段的多任務訓練範式，搭配能力導向的目標函數與適應性資料排程，強化語意理解與視覺生成效能。實驗結果顯示，Lance在圖片與影片生成方面明顯超越現有開源統一模型，同時保有強大的多模態理解能力。首頁網址為 https://lance-project.github.io。

English

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.