ChatPaper.aiChatPaper

TextSquare:擴展文字中心的視覺指導調整

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

April 19, 2024
作者: Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang
cs.AI

摘要

基於文本的視覺問答(VQA)隨著多模式大型語言模型(MLLMs)的發展取得了巨大進展,然而開源模型仍無法與領先的模型如GPT4V和Gemini相提並論,部分原因是缺乏廣泛且高質量的指導調整數據。為此,我們提出了一種新方法來創建一個龐大且高質量的指導調整數據集Square-10M,該數據集是使用閉源MLLMs生成的。數據構建過程稱為Square,包括四個步驟:自問自答、推理和評估。我們對Square-10M的實驗得出了三個關鍵發現:1)我們的模型TextSquare明顯超越了開源先前最先進的基於文本的MLLMs,並在OCRBench(62.2%)上設立了新標準。它甚至在10個基於文本的基準測試中的6個中勝過了頂尖模型如GPT4V和Gemini。2)此外,我們展示了VQA推理數據在為特定問題提供全面上下文洞察方面的關鍵作用。這不僅提高了準確性,還顯著減輕了幻覺。具體而言,TextSquare在四個通用VQA和幻覺評估數據集上平均得分為75.1%,優於先前最先進的模型。3)值得注意的是,在擴展基於文本的VQA數據集中觀察到的現象揭示了一個生動的模式:指導調整數據量的指數增長與模型性能的提升成正比,從而驗證了數據集規模和Square-10M高質量的必要性。
English
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Summary

AI-Generated Summary

PDF316December 15, 2024