ChatPaper.aiChatPaper

多模式自我指導:使用語言模型進行合成抽象圖像和視覺推理指導

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

July 9, 2024
作者: Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang
cs.AI

摘要

儘管目前大多數大型多模型模型(LMMs)已能理解自然場景和肖像的照片,但它們對抽象圖像(例如圖表、地圖或版面)和視覺推理能力的理解仍然非常基本。它們常常在一些簡單的日常任務中遇到困難,例如從時鐘上讀取時間、理解流程圖或使用道路地圖規劃路線。鑒於此,我們設計了一種多模自指示方法,利用大型語言模型及其代碼能力來合成各種日常情境中的大量抽象圖像和視覺推理指示。我們的策略輕鬆地創建了一個多模基準,包含了8個視覺情境的11,193個指示:圖表、表格、模擬地圖、儀表板、流程圖、關係圖、平面圖和視覺拼圖。這個基準由簡單的線條和幾何元素構建,揭示了像Claude-3.5-Sonnet和GPT-4o這樣的大多數先進LMMs在抽象圖像理解、空間關係推理和視覺元素歸納方面的不足之處。此外,為驗證我們合成數據的質量,我們使用了62,476個合成圖表、表格和道路地圖指示來對一個LMM進行微調。結果表明,圖表理解和地圖導航性能有所改善,同時也展示了對其他視覺推理任務的潛在益處。我們的代碼可在以下鏈接找到:https://github.com/zwq2018/Multi-modal-Self-instruct。
English
Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: https://github.com/zwq2018/Multi-modal-Self-instruct.

Summary

AI-Generated Summary

PDF473November 28, 2024