LLaVAR: 텍스트 풍부 이미지 이해를 위한 향상된 시각적 지시 튜닝

초록

명령어 튜닝은 대형 언어 모델(LLM)이 인간과 상호작용할 수 있는 우수한 능력을 발휘하도록 합니다. 더 나아가, 최근의 명령어 수행 데이터셋은 시각적 입력으로 이미지를 포함하며, 이미지 기반 명령어에 대한 응답을 수집합니다. 그러나 시각적 명령어 튜닝 모델은 이미지 내의 텍스트 세부 사항을 잘 이해하지 못합니다. 본 연구는 텍스트가 풍부한 이미지(예: 영화 포스터, 책 표지 등)를 활용하여 현재의 시각적 명령어 튜닝 파이프라인을 개선합니다. 구체적으로, 먼저 공개적으로 이용 가능한 OCR 도구를 사용하여 LAION 데이터셋의 422K 텍스트 풍부한 이미지에 대한 결과를 수집합니다. 또한, 인식된 텍스트와 이미지 캡션을 기반으로 텍스트 전용 GPT-4를 프롬프트하여 16K 대화를 생성하며, 각 대화는 텍스트 풍부한 이미지에 대한 질문-답변 쌍을 포함합니다. 수집한 데이터를 기존의 다중 모달 명령어 수행 데이터와 결합함으로써, 우리의 모델인 LLaVAR는 LLaVA 모델의 텍스트 기반 VQA 데이터셋에서의 성능을 크게 향상시키며(최대 20% 정확도 향상), ScienceQA에서 91.42%의 정확도를 달성합니다. GPT-4 기반 명령어 수행 평가는 또한 우리 모델이 자연 이미지와 텍스트 풍부한 이미지 모두에서 개선된 성능을 보임을 입증합니다. 질적 분석을 통해, LLaVAR는 텍스트와 이미지를 결합한 최신 실시간 온라인 콘텐츠를 기반으로 인간과의 상호작용(예: 추론, 글쓰기, 설명) 기술에서 유망한 가능성을 보여줍니다. 우리는 코드/데이터/모델을 https://llavar.github.io/에서 공개합니다.

English

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

LLaVAR: 텍스트 풍부 이미지 이해를 위한 향상된 시각적 지시 튜닝

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

초록

Support