DICEPTION：一款適用於視覺感知任務的通用擴散模型

摘要

我們的主要目標是創建一個優秀的通用感知模型，能夠在計算資源和訓練數據有限的情況下處理多種任務。為實現這一目標，我們採用了基於數十億圖像預訓練的文本到圖像擴散模型。我們全面的評估指標表明，DICEPTION 能有效應對多種感知任務，其性能可與最先進的模型相媲美。我們僅使用 SAM-vit-h 0.06% 的數據（例如，60萬對比10億像素級註釋圖像）便達到了與之相當的結果。受 Wang 等人的啟發，DICEPTION 採用色彩編碼來表達各種感知任務的輸出；我們展示了為不同實例隨機分配顏色的策略在實體分割和語義分割中極為有效。將多種感知任務統一為條件圖像生成，使我們能夠充分利用預訓練的文本到圖像模型。因此，與從頭訓練的傳統模型相比，DICEPTION 能以低數個數量級的成本高效訓練。當將我們的模型適應於其他任務時，僅需對少至50張圖像和1%的參數進行微調。DICEPTION 為視覺通用模型提供了寶貴的見解和更具前景的解決方案。

English

Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.

DICEPTION：一款適用於視覺感知任務的通用擴散模型

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

摘要

Support