InstructDiffusion: 비전 작업을 위한 범용 모델링 인터페이스

초록

우리는 컴퓨터 비전 작업을 인간의 지시와 정렬하기 위한 통합적이고 일반적인 프레임워크인 InstructDiffusion을 제안합니다. 기존 접근법들이 각 비전 작업에 대해 사전 지식을 통합하고 출력 공간(예: 카테고리 및 좌표)을 미리 정의하는 것과 달리, 우리는 다양한 비전 작업을 유연하고 상호작용 가능한 픽셀 공간을 출력으로 하는 인간 직관적인 이미지 조작 프로세스로 변환합니다. 구체적으로, 이 모델은 확산 프로세스(diffusion process)를 기반으로 구축되었으며, 사용자 지시에 따라 픽셀을 예측하도록 훈련됩니다. 예를 들어, 남자의 왼쪽 어깨를 빨간색으로 둘러싸거나 왼쪽 차에 파란색 마스크를 적용하는 등의 작업을 수행할 수 있습니다. InstructDiffusion은 세그멘테이션 및 키포인트 검출과 같은 이해 작업과 편집 및 향상과 같은 생성 작업을 포함한 다양한 비전 작업을 처리할 수 있습니다. 심지어 이 모델은 보지 못한 작업을 처리할 수 있는 능력을 보여주며, 새로운 데이터셋에서 기존 방법들을 능가합니다. 이는 비전 작업을 위한 일반적인 모델링 인터페이스로 나아가는 중요한 한 걸음이며, 컴퓨터 비전 분야에서 인공 일반 지능(artificial general intelligence)을 발전시키는 데 기여합니다.

English

We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

InstructDiffusion: 비전 작업을 위한 범용 모델링 인터페이스

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

초록

Support