CSD-VAR：視覺自回歸模型中的內容-風格分解

摘要

從單一圖像中分離內容與風格，即內容風格分解（CSD），能夠重新情境化提取的內容並對提取的風格進行風格化處理，從而為視覺合成提供更大的創作靈活性。儘管最近的個性化方法已探索了顯式內容風格的分解，但它們仍主要針對擴散模型進行優化。與此同時，視覺自迴歸建模（VAR）作為一種具有下一尺度預測範式的有前景的替代方案，已展現出與擴散模型相當的性能。本文中，我們探索將VAR作為CSD的生成框架，利用其逐尺度的生成過程來提升分解效果。為此，我們提出了CSD-VAR，一種新穎的方法，引入了三項關鍵創新：（1）一種尺度感知的交替優化策略，將內容與風格表示與其各自的尺度對齊以增強分離效果，（2）一種基於SVD的校正方法，以減少內容洩漏到風格表示中的情況，（3）一種增強的鍵值（K-V）記憶機制，以加強內容身份的保留。為了對該任務進行基準測試，我們引入了CSD-100，這是一個專為內容風格分解設計的數據集，包含以多種藝術風格呈現的多樣化主題。實驗表明，CSD-VAR在內容保留和風格化保真度方面均優於先前的方法，取得了卓越的表現。

English

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

CSD-VAR：視覺自回歸模型中的內容-風格分解

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

摘要

Support