CSD-VAR: 視覚的自己回帰モデルにおけるコンテンツ・スタイル分解

要旨

単一画像から内容とスタイルを分離するコンテンツ・スタイル分解（CSD）は、抽出された内容の再文脈化と抽出されたスタイルのスタイライゼーションを可能にし、視覚的合成における創造的な柔軟性を大幅に向上させます。最近のパーソナライゼーション手法では、明示的なコンテンツとスタイルの分解が探求されていますが、これらは拡散モデルに特化したままです。一方、Visual Autoregressive Modeling（VAR）は、次スケール予測パラダイムを採用した有望な代替手法として登場し、拡散モデルに匹敵する性能を達成しています。本論文では、VARをCSDの生成フレームワークとして探求し、そのスケールごとの生成プロセスを活用して分解を改善します。この目的のために、我々はCSD-VARという新しい手法を提案します。この手法は、以下の3つの主要な革新を導入します：（1）内容とスタイル表現をそれぞれのスケールに合わせることで分離を強化するスケール認識型交互最適化戦略、（2）スタイル表現への内容の漏れを軽減するSVDベースの補正方法、（3）内容の同一性保持を強化する拡張キー・バリュー（K-V）メモリ。このタスクをベンチマークするために、我々はCSD-100というデータセットを導入します。このデータセットは、様々な芸術的スタイルで描かれた多様な被写体を特徴とする、コンテンツ・スタイル分解に特化して設計されています。実験結果は、CSD-VARが従来の手法を上回り、優れた内容保持とスタイライゼーションの忠実度を達成することを示しています。

English

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

CSD-VAR: 視覚的自己回帰モデルにおけるコンテンツ・スタイル分解

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

要旨

Support