PickStyle: コンテキストスタイルアダプタを用いたビデオ間スタイル転送

要旨

我々は、拡散モデルを用いたビデオスタイル変換の課題に取り組む。この課題の目的は、入力ビデオの文脈を保ちつつ、テキストプロンプトで指定されたターゲットスタイルでレンダリングすることである。主要な課題は、教師データとしてのペアビデオデータの欠如である。我々は、PickStyleというビデオツービデオスタイル変換フレームワークを提案する。これは、事前学習済みのビデオ拡散バックボーンにスタイルアダプターを追加し、ソーススタイル対応を持つペア静止画データを活用して訓練する。PickStyleは、条件付けモジュールの自己注意層に低ランクアダプターを挿入し、ビデオ内容とスタイルの強力な整合性を保ちつつ、モーションスタイル変換のための効率的な特殊化を可能にする。静止画の教師データと動的ビデオのギャップを埋めるため、カメラモーションをシミュレートする共有拡張を適用してペア画像から合成訓練クリップを構築し、時間的プライアを保持する。さらに、我々は、Context-Style Classifier-Free Guidance (CS-CFG)を導入する。これは、クラスフリーガイダンスを独立したテキスト（スタイル）とビデオ（文脈）方向に分解する新しい手法である。CS-CFGは、生成されたビデオで文脈が保持されつつ、スタイルが効果的に変換されることを保証する。ベンチマーク実験により、我々のアプローチが時間的に一貫し、スタイルに忠実で、内容を保持するビデオ変換を実現し、既存のベースラインを質的・量的に上回ることが示された。

English

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

PickStyle: コンテキストスタイルアダプタを用いたビデオ間スタイル転送

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

要旨

Support