Abstract: We present DICEPTION, a novel generalist diffusion model designed for a wide range of visual perceptual tasks. Unlike task-specific models, DICEPTION leverages the power of diffusion processes to handle diverse visual challenges through a unified architecture. Our model demonstrates superior performance across multiple benchmarks, including image classification, object detection, and semantic segmentation. The key innovation lies in its ability to adaptively learn task-specific features while maintaining a shared representation space. Extensive experiments validate the effectiveness of DICEPTION in both supervised and unsupervised settings, showcasing its potential as a versatile tool for computer vision applications. We also provide insights into the model's interpretability and discuss its implications for future research in generalist AI systems.
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
February 24, 2025
Autores: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
cs.AI
Resumen
Our results suggest that DICEPTION can be a stepping stone towards
generalist models that can perform multiple perception tasks with minimal
computational resources and training data.Nuestro objetivo principal aquí es crear un buen modelo de percepción generalista que pueda abordar múltiples tareas, dentro de los límites de los recursos computacionales y los datos de entrenamiento. Para lograrlo, recurrimos a modelos de difusión de texto a imagen preentrenados en miles de millones de imágenes. Nuestras métricas de evaluación exhaustivas demuestran que DICEPTION aborda eficazmente múltiples tareas de percepción, logrando un rendimiento comparable con los modelos de última generación. Obtenemos resultados comparables a SAM-vit-h utilizando solo el 0,06% de sus datos (por ejemplo, 600K frente a 1B de imágenes anotadas a nivel de píxel). Inspirado por Wang et al., DICEPTION formula las salidas de varias tareas de percepción utilizando codificación de colores; y demostramos que la estrategia de asignar colores aleatorios a diferentes instancias es altamente efectiva tanto en la segmentación de entidades como en la segmentación semántica. Unificar varias tareas de percepción como generación condicional de imágenes nos permite aprovechar al máximo los modelos preentrenados de texto a imagen. Así, DICEPTION puede entrenarse de manera eficiente con un costo órdenes de magnitud menor, en comparación con los modelos convencionales que se entrenaron desde cero. Al adaptar nuestro modelo a otras tareas, solo requiere un ajuste fino en tan solo 50 imágenes y el 1% de sus parámetros. DICEPTION proporciona ideas valiosas y una solución más prometedora para los modelos visuales generalistas. Nuestros resultados sugieren que DICEPTION puede ser un paso hacia modelos generalistas que puedan realizar múltiples tareas de percepción con recursos computacionales y datos de entrenamiento mínimos.
English
Our primary goal here is to create a good, generalist perception model that
can tackle multiple tasks, within limits on computational resources and
training data. To achieve this, we resort to text-to-image diffusion models
pre-trained on billions of images. Our exhaustive evaluation metrics
demonstrate that DICEPTION effectively tackles multiple perception tasks,
achieving performance on par with state-of-the-art models. We achieve results
on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B
pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates
the outputs of various perception tasks using color encoding; and we show that
the strategy of assigning random colors to different instances is highly
effective in both entity segmentation and semantic segmentation. Unifying
various perception tasks as conditional image generation enables us to fully
leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently
trained at a cost of orders of magnitude lower, compared to conventional models
that were trained from scratch. When adapting our model to other tasks, it only
requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION
provides valuable insights and a more promising solution for visual generalist
models.Summary
AI-Generated Summary