SVIT: 시각적 명령 튜닝의 확장

초록

파운데이션 모델(foundation model)의 등장으로 인해 대규모 언어 및 비전 모델이 통합되면서 시각적 캡셔닝, 대화, 질문 응답 등 다중모달 능력을 획득하게 되었습니다. 기존의 다중모달 모델들은 시각적 이해와 추론에서 인상적인 성능을 보여주지만, 고품질의 지시 튜닝 데이터가 부족하기 때문에 그 한계는 여전히 크게 탐구되지 않고 있습니다. 다중모달 능력의 한계를 더욱 확장하기 위해, 우리는 320만 개의 시각적 지시 튜닝 데이터로 구성된 SVIT(Scaled-up Visual Instruction Tuning) 데이터셋을 구축했습니다. 이 데이터셋은 160만 개의 대화형 질문-응답(QA) 쌍, 160만 개의 복잡한 추론 QA 쌍, 그리고 106,000개의 상세한 이미지 설명을 포함하고 있습니다. 데이터셋의 규모뿐만 아니라, GPT-4를 활용하여 수동으로 주석이 달린 이미지들을 기반으로 생성된 고품질과 풍부한 다양성도 특징입니다. 우리는 SVIT를 통해 다중모달 모델을 학습시키는 것이 시각적 인지, 추론 및 계획 측면에서 다중모달 성능을 크게 향상시킬 수 있음을 실증적으로 검증했습니다.

English

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

SVIT: 시각적 명령 튜닝의 확장

SVIT: Scaling up Visual Instruction Tuning

초록

Support