Q-Instruct：提升多模态基础模型的低层视觉能力

摘要

多模态基础模型，如GPT-4V所代表的，为低层次视觉感知和理解任务带来了新的范式，可以响应模型中广泛的自然人类指令。虽然现有的基础模型在低层次视觉任务上展现出令人兴奋的潜力，但其相关能力仍处于初步阶段，需要改进。为了增强这些模型，我们进行了大规模主观实验，收集了大量关于低层次视觉的真实人类反馈。每个反馈都遵循一条路径，从对图像的低层视觉外观（如清晰度、颜色、亮度）的详细描述开始，以平均长度为45个词的总体结论结束。构建的Q-Pathway数据集包括18,973张具有多样低层外观的图像上的58K个详细人类反馈。此外，为了使基础模型能够稳健地回应各种类型的问题，我们设计了一个由GPT参与的转换，将这些反馈处理成多格式的200K个指令-响应对。实验结果表明，Q-Instruct能够持续提升几个基础模型的低层感知和理解能力。我们预计我们的数据集可以为未来一种普适智能能够像人类一样感知、理解低层视觉外观并评估视觉质量的道路铺平。我们的数据集、模型库和演示发布在：https://q-future.github.io/Q-Instruct。

English

Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.

Q-Instruct：提升多模态基础模型的低层视觉能力

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

摘要

Support