VisualCloze:基於視覺上下文學習的通用圖像生成框架
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
April 10, 2025
作者: Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng
cs.AI
摘要
近期在擴散模型領域的進展顯著推動了各類圖像生成任務的發展。然而,當前的主流方法仍集中於構建特定任務的模型,這在支持廣泛需求時效率有限。儘管通用模型試圖解決這一限制,但它們面臨著關鍵挑戰,包括可泛化的任務指令、適宜的任務分佈以及統一的架構設計。為應對這些挑戰,我們提出了VisualCloze,一個通用的圖像生成框架,它支持廣泛的域內任務、對未見任務的泛化、多任務的無縫統一以及反向生成。與現有依賴基於語言的任務指令、導致任務模糊性和弱泛化能力的方法不同,我們整合了視覺上下文學習,使模型能夠從視覺示範中識別任務。同時,視覺任務分佈的固有稀疏性阻礙了跨任務可遷移知識的學習。為此,我們引入了Graph200K,這是一個圖結構的數據集,建立了多種相互關聯的任務,增強了任務密度和可遷移知識。此外,我們發現我們的統一圖像生成公式與圖像修補共享一致的目標,使我們能夠在不修改架構的情況下利用預訓練修補模型的強大生成先驗。
English
Recent progress in diffusion models significantly advances various image
generation tasks. However, the current mainstream approach remains focused on
building task-specific models, which have limited efficiency when supporting a
wide range of different needs. While universal models attempt to address this
limitation, they face critical challenges, including generalizable task
instruction, appropriate task distributions, and unified architectural design.
To tackle these challenges, we propose VisualCloze, a universal image
generation framework, which supports a wide range of in-domain tasks,
generalization to unseen ones, unseen unification of multiple tasks, and
reverse generation. Unlike existing methods that rely on language-based task
instruction, leading to task ambiguity and weak generalization, we integrate
visual in-context learning, allowing models to identify tasks from visual
demonstrations. Meanwhile, the inherent sparsity of visual task distributions
hampers the learning of transferable knowledge across tasks. To this end, we
introduce Graph200K, a graph-structured dataset that establishes various
interrelated tasks, enhancing task density and transferable knowledge.
Furthermore, we uncover that our unified image generation formulation shared a
consistent objective with image infilling, enabling us to leverage the strong
generative priors of pre-trained infilling models without modifying the
architectures.