InstructDiffusion: 視覚タスクのための汎用モデリングインターフェース

要旨

我々は、コンピュータビジョンタスクと人間の指示を統合する汎用的なフレームワークであるInstructDiffusionを提案する。既存のアプローチでは、各ビジョンタスクに対して事前知識を統合し、出力空間（例えば、カテゴリや座標）を事前に定義するが、我々は多様なビジョンタスクを、出力空間が柔軟でインタラクティブなピクセル空間である人間直感的な画像操作プロセスに変換する。具体的には、このモデルは拡散プロセスに基づいて構築され、ユーザーの指示に従ってピクセルを予測するように訓練される。例えば、男性の左肩を赤で囲む、左の車に青いマスクを適用するなどである。InstructDiffusionは、セグメンテーションやキーポイント検出などの理解タスク、編集やエンハンスメントなどの生成タスクを含む多様なビジョンタスクを扱うことができる。さらに、未見のタスクを処理する能力を示し、新しいデータセットにおいて従来の手法を上回る性能を発揮する。これは、ビジョンタスクのための汎用モデリングインターフェースに向けた重要な一歩であり、コンピュータビジョン分野における人工汎用知能の進展を促進するものである。

English

We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

InstructDiffusion: 視覚タスクのための汎用モデリングインターフェース

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

要旨

Support