Lance: 멀티태스크 시너지를 통한 통합 멀티모달 모델링

초록

우리는 이미지와 비디오 모두에 대한 멀티모달 이해, 생성 및 편집을 지원하는 경량 네이티브 통합 모델인 Lance를 제시한다. 모델 용량 확장이나 텍스트-이미지 중심 설계에 의존하는 대신, Lance는 협력적 멀티태스크 학습을 통해 통합 멀티모달 모델링을 위한 실용적 패러다임을 탐구한다. 이는 통합 컨텍스트 모델링과 분리된 능력 경로라는 두 가지 핵심 원칙에 기반한다. 구체적으로 Lance는 처음부터 학습되며, 공유된 인터리브 멀티모달 시퀀스 상에서 이중 스트림 혼합 전문가 아키텍처를 사용하여 공동 컨텍스트 학습을 가능하게 하는 동시에 이해와 생성을 위한 경로를 분리한다. 또한 이질적 시각 토큰 간 간섭을 완화하고 교차 태스크 정렬을 촉진하기 위해 모달리티 인지 회전 위치 인코딩을 도입한다. 학습 과정에서 Lance는 능력 중심 목표와 적응적 데이터 스케줄링을 채택한 단계적 멀티태스크 학습 패러다임을 사용하여 의미 이해와 시각 생성 성능을 모두 강화한다. 실험 결과는 Lance가 강력한 멀티모달 이해 능력을 유지하면서도 이미지 및 비디오 생성에서 기존 오픈소스 통합 모델을 크게 능가함을 보여준다. 홈페이지는 https://lance-project.github.io에서 확인할 수 있다.

English

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.