Q-Instruct：提升多模態基礎模型的低層視覺能力

摘要

多模態基礎模型，如GPT-4V所代表的，為低層次視覺感知和理解任務帶來了新的範式，能夠回應模型中廣泛的自然人類指令。儘管現有的基礎模型在低層次視覺任務上展示出令人振奮的潛力，但其相關能力仍處於初步階段，需要改進。為了增強這些模型，我們進行了一項大規模主觀實驗，收集了大量真實人類對低層次視覺的反饋。每個反饋都遵循一個路徑，從對圖像的低層次視覺外觀（例如清晰度、顏色、亮度）的詳細描述開始，最終以一個總結結束，平均長度為45個字。構建的**Q-Pathway**數據集包括18,973張外觀多樣的圖像上的58K個詳細人類反饋。此外，為了使基礎模型能夠堅固地回應各種類型的問題，我們設計了一個GPT參與的轉換，將這些反饋處理成多格式的200K指令-回應對。實驗結果表明，**Q-Instruct**持續提升了幾個基礎模型的低層次感知和理解能力。我們預期我們的數據集可以為未來一種普遍智能能夠像人類一樣感知、理解低層次視覺外觀並評估視覺質量的未來鋪平道路。我們的數據集、模型庫和演示已發表在：https://q-future.github.io/Q-Instruct。

English

Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.

Q-Instruct：提升多模態基礎模型的低層視覺能力

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

摘要

Support