InstructDiffusion：一個用於視覺任務的通用建模界面

摘要

我們提出了InstructDiffusion，這是一個統一且通用的框架，用於將計算機視覺任務與人類指示對齊。與現有方法不同，這些方法整合了先前知識並為每個視覺任務預先定義了輸出空間（例如類別和座標），我們將各種視覺任務轉換為一個直觀的人類影像處理過程，其輸出空間是一個靈活且互動的像素空間。具體而言，該模型建立在擴散過程之上，並訓練以根據用戶指示預測像素，例如用紅色圈出男人的左肩或對左側汽車應用藍色遮罩。InstructDiffusion可以處理各種視覺任務，包括理解任務（如分割和關鍵點檢測）和生成任務（如編輯和增強）。它甚至展現了處理未見過任務的能力，並在新數據集上優於先前方法。這代表了通向視覺任務通用建模界面的重要一步，推動了計算機視覺領域人工通用智能的發展。

English

We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

InstructDiffusion：一個用於視覺任務的通用建模界面

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

摘要

Support