CSD-VAR：视觉自回归模型中的内容-风格分解

摘要

从单张图像中分离内容与风格，即内容-风格分解（CSD），能够实现提取内容的再情境化与提取风格的再风格化，为视觉合成提供了更大的创作灵活性。尽管近期的个性化方法已探索了显式内容与风格的分解，但这些方法仍主要针对扩散模型设计。与此同时，视觉自回归建模（VAR）作为一种具有下一尺度预测范式的有前景替代方案崭露头角，其性能可与扩散模型相媲美。本文探讨了将VAR作为CSD生成框架的潜力，利用其逐尺度生成过程以提升分解效果。为此，我们提出了CSD-VAR，一种新颖的方法，引入了三大关键创新：(1) 一种尺度感知的交替优化策略，通过将内容与风格表示与其各自尺度对齐来增强分离效果；(2) 一种基于SVD的校正方法，以减少内容向风格表示的泄露；(3) 一种增强型键值（K-V）记忆机制，以强化内容身份的保持。为了对该任务进行基准测试，我们引入了CSD-100，这是一个专为内容-风格分解设计的数据集，包含以多种艺术风格呈现的多样化主题。实验表明，CSD-VAR在内容保持与风格化保真度上均优于现有方法，展现了卓越的性能。

English

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

CSD-VAR：视觉自回归模型中的内容-风格分解

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

摘要

Support