Q-Instruct: 멀티모달리티 파운데이션 모델을 위한 저수준 시각 능력 향상

초록

GPT-4V로 대표되는 멀티모달리티 파운데이션 모델은 저수준 시각 인식 및 이해 작업에 새로운 패러다임을 가져왔으며, 다양한 자연어 지시에 응답할 수 있는 모델을 가능하게 했습니다. 기존 파운데이션 모델들이 저수준 시각 작업에서 흥미로운 잠재력을 보여주었음에도 불구하고, 관련 능력들은 여전히 초기 단계에 머물러 있으며 개선이 필요합니다. 이러한 모델들을 향상시키기 위해, 우리는 대규모 주관적 실험을 통해 저수준 시각에 대한 방대한 양의 실제 인간 피드백을 수집했습니다. 각 피드백은 이미지의 선명도, 색상, 밝기와 같은 저수준 시각적 외관에 대한 상세한 설명으로 시작하여 전체적인 결론으로 끝나는 경로를 따르며, 평균 45단어의 길이를 가집니다. 구축된 **Q-Pathway** 데이터셋은 다양한 저수준 외관을 가진 18,973장의 이미지에 대한 58,000개의 상세한 인간 피드백을 포함합니다. 더 나아가, 파운데이션 모델이 다양한 유형의 질문에 견고하게 응답할 수 있도록, 우리는 GPT가 참여한 변환 과정을 설계하여 이러한 피드백을 200,000개의 다양한 형식의 지시-응답 쌍으로 처리했습니다. 실험 결과는 **Q-Instruct**가 여러 파운데이션 모델에서 저수준 인식 및 이해 능력을 지속적으로 향상시킨다는 것을 보여줍니다. 우리는 이 데이터셋이 미래에 일반 지능이 인간처럼 저수준 시각적 외관을 인식하고 이해하며 시각적 품질을 평가할 수 있는 길을 열어줄 것으로 기대합니다. 우리의 데이터셋, 모델 저장소, 데모는 https://q-future.github.io/Q-Instruct에서 공개되었습니다.

English

Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.

Q-Instruct: 멀티모달리티 파운데이션 모델을 위한 저수준 시각 능력 향상

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

초록

Support