Diff-2-in-1: 通過擴散模型橋接世代和密集感知
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
November 7, 2024
作者: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang
cs.AI
摘要
除了高保真度圖像合成外,擴散模型最近在密集視覺感知任務中展現出有希望的結果。然而,大多數現有工作將擴散模型視為感知任務的獨立組件,將其僅用於現成數據擴增或僅作為特徵提取器。與這些孤立且因此次優的努力相反,我們引入了一個統一、多功能的基於擴散的框架,Diff-2-in-1,通過對擴散去噪過程的獨特利用,可以同時處理多模態數據生成和密集視覺感知。在這個框架內,我們通過利用去噪網絡創建與原始訓練集分佈相似的多模態數據,進一步通過多模態生成增強區分性視覺感知。重要的是,Diff-2-in-1通過利用一種新型的自我改進學習機制,優化了所創建的多樣和忠實數據的利用。全面的實驗評估驗證了我們框架的有效性,展示了在各種區分性骨幹和高質量多模態數據生成方面的一致性性能改進,其特徵是現實感和實用性。
English
Beyond high-fidelity image synthesis, diffusion models have recently
exhibited promising results in dense visual perception tasks. However, most
existing work treats diffusion models as a standalone component for perception
tasks, employing them either solely for off-the-shelf data augmentation or as
mere feature extractors. In contrast to these isolated and thus sub-optimal
efforts, we introduce a unified, versatile, diffusion-based framework,
Diff-2-in-1, that can simultaneously handle both multi-modal data generation
and dense visual perception, through a unique exploitation of the
diffusion-denoising process. Within this framework, we further enhance
discriminative visual perception via multi-modal generation, by utilizing the
denoising network to create multi-modal data that mirror the distribution of
the original training set. Importantly, Diff-2-in-1 optimizes the utilization
of the created diverse and faithful data by leveraging a novel self-improving
learning mechanism. Comprehensive experimental evaluations validate the
effectiveness of our framework, showcasing consistent performance improvements
across various discriminative backbones and high-quality multi-modal data
generation characterized by both realism and usefulness.