Lance: マルチタスクシナジーによる統一マルチモーダルモデリング

要旨

本稿では、画像と動画の両方に対応したマルチモーダル理解、生成、編集を統合的に実現する軽量なネイティブ統一モデル「Lance」を提案する。Lanceはモデルの容量拡大やテキスト-画像優位の設計に依存するのではなく、協調的なマルチタスク学習による実用的な統合マルチモーダルモデリングのパラダイムを探求する。その基盤は、統一的なコンテキストモデリングと分離可能な機能経路という2つの核心的原則に置かれている。具体的には、Lanceはスクラッチから学習され、共有されたインターリーブ型マルチモーダル系列に対してデュアルストリーム混合専門家（MoE）アーキテクチャを採用し、理解と生成の経路を分離しつつ、結合的なコンテキスト学習を可能にする。さらに、異種の視覚トークン間の干渉を軽減し、クロスタスクの整合性を高めるために、モダリティ認識型回転位置符号化（RoPE）を導入する。学習過程では、能力指向の目的関数と適応的なデータスケジューリングを備えた段階的マルチタスク学習パラダイムを採用し、意味理解と視覚生成性能の両方を強化する。実験結果は、Lanceが画像および動画生成において既存のオープンソース統合モデルを大幅に凌駕しつつ、強力なマルチモーダル理解能力を維持することを示している。ホームページは https://lance-project.github.io で公開されている。

English

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.