梦幻风格:视频风格化统一框架
DreamStyle: A Unified Framework for Video Stylization
January 6, 2026
作者: Mengtian Li, Jinshu Chen, Songtao Zhao, Wanquan Feng, Pengqi Tu, Qian He
cs.AI
摘要
视频风格化作为视频生成模型的重要下游任务,目前尚未得到充分探索。其输入风格条件通常包括文本、风格图像和首帧风格化参考。每种条件各具优势:文本描述更灵活,风格图像提供更精确的视觉锚点,而首帧风格化使长视频风格化成为可能。然而现有方法大多局限于单一类型的风格条件,限制了应用范围。此外,高质量数据集的缺失导致风格不一致和时间闪烁问题。为突破这些局限,我们提出DreamStyle——支持(1)文本引导、(2)风格图像引导、(3)首帧引导视频风格化的统一框架,并设计了精心构建的数据处理流程以获取高质量配对视频数据。DreamStyle基于原生图像到视频(I2V)模型,通过采用具有词元特异性上行矩阵的低秩自适应(LoRA)进行训练,有效减少不同条件词元间的混淆。定性与定量评估均表明,DreamStyle能胜任三类视频风格化任务,在风格一致性和视频质量方面优于现有方法。
English
Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.