DICEPTION:一款適用於視覺感知任務的通用擴散模型
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
February 24, 2025
作者: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
cs.AI
摘要
我們的主要目標是創建一個優秀的通用感知模型,能夠在計算資源和訓練數據有限的情況下處理多種任務。為實現這一目標,我們採用了基於數十億圖像預訓練的文本到圖像擴散模型。我們全面的評估指標表明,DICEPTION 能有效應對多種感知任務,其性能可與最先進的模型相媲美。我們僅使用 SAM-vit-h 0.06% 的數據(例如,60萬對比10億像素級註釋圖像)便達到了與之相當的結果。受 Wang 等人的啟發,DICEPTION 採用色彩編碼來表達各種感知任務的輸出;我們展示了為不同實例隨機分配顏色的策略在實體分割和語義分割中極為有效。將多種感知任務統一為條件圖像生成,使我們能夠充分利用預訓練的文本到圖像模型。因此,與從頭訓練的傳統模型相比,DICEPTION 能以低數個數量級的成本高效訓練。當將我們的模型適應於其他任務時,僅需對少至50張圖像和1%的參數進行微調。DICEPTION 為視覺通用模型提供了寶貴的見解和更具前景的解決方案。
English
Our primary goal here is to create a good, generalist perception model that
can tackle multiple tasks, within limits on computational resources and
training data. To achieve this, we resort to text-to-image diffusion models
pre-trained on billions of images. Our exhaustive evaluation metrics
demonstrate that DICEPTION effectively tackles multiple perception tasks,
achieving performance on par with state-of-the-art models. We achieve results
on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B
pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates
the outputs of various perception tasks using color encoding; and we show that
the strategy of assigning random colors to different instances is highly
effective in both entity segmentation and semantic segmentation. Unifying
various perception tasks as conditional image generation enables us to fully
leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently
trained at a cost of orders of magnitude lower, compared to conventional models
that were trained from scratch. When adapting our model to other tasks, it only
requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION
provides valuable insights and a more promising solution for visual generalist
models.Summary
AI-Generated Summary