LoomVideo: 멀티모달 입력을 비디오 생성 및 편집으로 통합

초록

통합 비디오 생성 및 편집 모델을 개발하여 인터리브된 다중 모달 입력을 해석하는 것은 유망하면서도 도전적인 최첨단 연구 분야이다. 기존의 통합 프레임워크는 주로 대규모 모델(일반적으로 13B 파라미터 이상)에 의존하며, 소스 비디오 조건을 시퀀스 토큰 연결을 통해 통합하여 편집을 수행한다. 이러한 연결은 시퀀스 길이를 필연적으로 두 배로 늘려 자기 주의 메커니즘의 계산 복잡성을 네 배로 증가시키고 과도한 오버헤드를 초래한다. 이러한 병목 현상을 해결하기 위해, 우리는 LoomVideo를 제안한다. 이는 비디오 생성과 편집을 모두 수행하는 고효율 5B 파라미터 통합 아키텍처이다. LoomVideo는 표준 텍스트 인코더를 다중 모달 대규모 언어 모델(MLLM)로 대체하고, Deepstack 주입 메커니즘을 사용하여 다층 MLLM 특징을 확산 트랜스포머(DiT)와 정렬한다. 핵심적으로, 우리는 비디오 편집을 위해 제로 오버헤드의 Scale-and-Add 조건화 방식을 도입한다. 깨끗한 소스 비디오 잠재 변수를 노이즈가 추가된 대상 잠재 변수에 스케일링하여 직접 더함으로써, 이 우아한 설계는 토큰 연결의 필요성을 제거하여 계산 비용을 획기적으로 줄이면서도 복잡한 비강체 편집에 강력한 성능을 유지한다. 또한, Negative Temporal RoPE 전략을 매끄럽게 통합하여 여러 참조 이미지를 처리한다. 광범위한 실험을 통해, 우리의 소형 5B 모델이 포괄적 벤치마크에서 최첨단 또는 경쟁력 있는 성능을 달성하며, 전자상거래 및 패션 생성 시나리오에서 탁월한 우수성을 보인다. 제로 오버헤드 조건화 메커니즘 덕분에 LoomVideo는 유사한 성능의 모델 대비 추론 속도에서 최소 5.41배 가속을 달성하여, 실용적이고 효율적인 비디오 기반 모델을 위한 길을 열어준다.

English

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.