Q-Instruct: マルチモーダル基盤モデルのための低レベル視覚能力の向上

要旨

GPT-4Vに代表されるマルチモーダル基盤モデルは、低レベル視覚知覚と理解タスクに新たなパラダイムをもたらし、幅広い自然な人間の指示に応答できるモデルを実現しました。既存の基盤モデルは低レベル視覚タスクにおいて有望な可能性を示していますが、その関連能力はまだ初歩的であり、改善が必要です。これらのモデルを強化するため、私たちは大規模な主観実験を実施し、低レベル視覚に関する膨大な量の実際の人間のフィードバックを収集しました。各フィードバックは、画像の明瞭さ、色、明るさなどの低レベル視覚的外観に関する詳細な説明から始まり、全体の結論で終わる経路をたどり、平均45語の長さを持ちます。構築された**Q-Pathway**データセットには、多様な低レベル外観を持つ18,973枚の画像に対する58,000件の詳細な人間のフィードバックが含まれています。さらに、基盤モデルが多様なタイプの質問に堅牢に応答できるようにするため、これらのフィードバックを多様な形式の20万件の指示-応答ペアに変換するGPT参加型のプロセスを設計しました。実験結果は、**Q-Instruct**が複数の基盤モデルにおいて低レベル知覚と理解能力を一貫して向上させることを示しています。私たちのデータセットが、汎用知能が人間のように低レベル視覚的外観を認識し、理解し、視覚品質を評価できる未来への道を開くことを期待しています。私たちのデータセット、モデルライブラリ、デモは以下で公開されています: https://q-future.github.io/Q-Instruct.

English

Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.

Q-Instruct: マルチモーダル基盤モデルのための低レベル視覚能力の向上

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

要旨

Support