LLaVAR：增強型視覺指示調整以提升文字豐富圖像理解

摘要

指令調整可以發揮大型語言模型（LLM）與人類互動的卓越能力。此外，最近的指令遵循數據集包括圖像作為視覺輸入，為基於圖像的指令收集回應。然而，視覺指令調整模型無法很好地理解圖像中的文本細節。本研究通過添加文本豐富的圖像（例如電影海報、書籍封面等）來增強當前的視覺指令調整流程。具體而言，我們首先使用公開可用的OCR工具從LAION數據集的422K文本豐富圖像中收集結果。此外，我們使用識別的文本和圖像標題提示僅文本的GPT-4生成16K對話，每個對話包含文本豐富圖像的問答對。通過將我們收集的數據與先前的多模式指令遵循數據結合，我們的模型LLaVAR在基於文本的VQA數據集上顯著提高了LLaVA模型的能力（最多提高20％的準確性），同時在ScienceQA上實現了91.42％的準確性。基於GPT-4的指令遵循評估還展示了我們的模型在自然圖像和文本豐富圖像上的改進。通過定性分析，LLaVAR基於結合文本和圖像的最新現實世界在線內容，展示了與人類互動（例如推理、寫作和闡釋）技能的潛力。我們將我們的代碼/數據/模型公開提供在https://llavar.github.io/。

English

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

LLaVAR：增強型視覺指示調整以提升文字豐富圖像理解

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

摘要

Support