文本方块：扩展文本为中心的视觉指导调整

摘要

基于文本的视觉问答（VQA）随着多模态大型语言模型（MLLMs）的发展取得了巨大进展，然而开源模型仍然无法与GPT4V和Gemini等领先模型媲美，部分原因是缺乏广泛且高质量的指导微调数据。为此，我们提出了一种新方法来创建一个庞大、高质量的指导微调数据集Square-10M，该数据集是使用闭源MLLMs生成的。数据构建过程被称为Square，包括四个步骤：自我提问、回答、推理和评估。我们对Square-10M的实验得出了三个关键发现：1）我们的模型TextSquare明显超越了开源先前的文本中心MLLMs的最新水平，并在OCRBench（62.2%）上树立了新标准。它甚至在10个文本中心基准中的6个中胜过了GPT4V和Gemini等顶尖模型。2）此外，我们展示了VQA推理数据在为特定问题提供全面上下文洞察方面的关键作用。这不仅提高了准确性，还显著减轻了幻觉。具体而言，TextSquare在四个常规VQA和幻觉评估数据集上平均得分为75.1%，超过了先前的最先进模型。3）值得注意的是，在扩展文本中心VQA数据集中观察到的现象揭示了一个生动的模式：指导微调数据量的指数增长与模型性能的提升成正比，从而验证了数据集规模和Square-10M的高质量的必要性。

English

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

文本方块：扩展文本为中心的视觉指导调整

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

摘要

Support