LLaVAR：增强型视觉指导调整用于文本丰富的图像理解

摘要

指令调整释放了大型语言模型（LLM）与人类互动的卓越能力。此外，最近的指令遵循数据集包括图像作为视觉输入，收集基于图像指令的响应。然而，视觉指令调整模型无法很好地理解图像中的文本细节。本研究通过文本丰富的图像（如电影海报、书籍封面等）增强了当前的视觉指令调整流程。具体而言，我们首先使用公开可用的OCR工具从LAION数据集的422K文本丰富图像中收集结果。此外，我们使用识别的文本和图像标题提示仅文本的GPT-4生成了16K对话，每个对话包含了针对文本丰富图像的问答对。通过将我们收集的数据与先前的多模态指令遵循数据相结合，我们的模型LLaVAR在文本为基础的VQA数据集上显著提升了LLaVA模型的能力（准确率提高了高达20%），同时在ScienceQA上实现了91.42%的准确率。基于GPT-4的指令遵循评估还展示了我们的模型在自然图像和文本丰富图像上的改进。通过定性分析，LLaVAR展示了与人类基于最新结合文本和图像的真实在线内容的互动能力（如推理、写作和阐述）技能。我们将我们的代码/数据/模型公开发布在https://llavar.github.io/。

English

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

LLaVAR：增强型视觉指导调整用于文本丰富的图像理解

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

摘要

Support