DICEPTION: 시각적 인지 작업을 위한 범용 확산 모델

초록

여기서 우리의 주요 목표는 계산 자원과 학습 데이터에 대한 제약 내에서 여러 작업을 처리할 수 있는 우수한 일반화 지각 모델을 만드는 것입니다. 이를 위해 우리는 수십억 장의 이미지로 사전 학습된 텍스트-이미지 확산 모델을 활용합니다. 우리의 포괄적인 평가 지표는 DICEPTION이 여러 지각 작업을 효과적으로 처리하며 최첨단 모델과 동등한 성능을 달성함을 보여줍니다. 우리는 SAM-vit-h와 동등한 결과를 달성하면서도 그들의 데이터 중 단 0.06%만 사용했습니다(예: 600K vs. 1B 픽셀 수준 주석 이미지). Wang 등의 연구에서 영감을 받아, DICEPTION은 다양한 지각 작업의 출력을 색상 인코딩을 사용하여 표현하며, 서로 다른 인스턴스에 무작위 색상을 할당하는 전략이 엔티티 분할과 의미론적 분할 모두에서 매우 효과적임을 보여줍니다. 다양한 지각 작업을 조건부 이미지 생성으로 통합함으로써, 우리는 사전 학습된 텍스트-이미지 모델을 완전히 활용할 수 있습니다. 따라서 DICEPTION은 처음부터 학습된 기존 모델에 비해 수십 배 낮은 비용으로 효율적으로 학습될 수 있습니다. 우리의 모델을 다른 작업에 적용할 때는 단 50장의 이미지와 파라미터의 1%만으로 미세 조정이 필요합니다. DICEPTION은 시각적 일반화 모델에 대한 귀중한 통찰력과 더 유망한 해결책을 제공합니다.

English

Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.

DICEPTION: 시각적 인지 작업을 위한 범용 확산 모델

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

초록

Support