PickStyle：基于上下文风格适配器的视频到视频风格迁移

摘要

我们致力于解决基于扩散模型的视频风格迁移任务，其目标是在保持输入视频内容的同时，根据文本提示将其渲染为目标风格。这一任务面临的主要挑战是缺乏成对的视频数据用于监督。为此，我们提出了PickStyle，一个视频到视频风格迁移框架，该框架通过风格适配器增强预训练的视频扩散模型骨干，并利用具有源风格对应关系的成对静态图像数据进行训练。PickStyle在条件模块的自注意力层中嵌入低秩适配器，实现了对运动风格迁移的高效专门化，同时确保了视频内容与风格之间的强对齐。为了弥合静态图像监督与动态视频之间的差距，我们通过对成对图像应用模拟相机运动的共享增强来构建合成训练片段，从而保留时间先验。此外，我们引入了上下文-风格无分类器指导（CS-CFG），这是一种将无分类器指导分解为独立文本（风格）和视频（上下文）方向的新颖方法。CS-CFG确保生成视频中上下文得以保留，同时风格得到有效迁移。跨基准的实验表明，我们的方法实现了时间连贯、风格忠实且内容保持的视频转换，在质量和数量上均优于现有基线。

English

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

PickStyle：基于上下文风格适配器的视频到视频风格迁移

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

摘要

Support