ChatPaper.aiChatPaper

多模态自我指导:使用语言模型进行合成抽象图像和视觉推理指导。

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

July 9, 2024
作者: Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang
cs.AI

摘要

尽管当前大多数大型多模态模型(LMMs)已经能够理解自然场景和肖像照片,但它们对抽象图像(例如图表、地图或布局)以及视觉推理能力的理解仍然非常基础。它们经常在简单的日常任务中遇到困难,比如从时钟上读取时间、理解流程图或使用道路地图规划路线。鉴于此,我们设计了一个多模态自我指导系统,利用大型语言模型及其代码能力在日常场景中合成大量抽象图像和视觉推理指导。我们的策略轻松创建了一个多模态基准,包括八个视觉场景的11,193个指导:图表、表格、模拟地图、仪表盘、流程图、关系图、平面图和视觉拼图。这个基准是用简单的线条和几何元素构建的,揭示了大多数先进LMMs(如Claude-3.5-Sonnet和GPT-4o)在抽象图像理解、空间关系推理和视觉元素归纳方面的不足。此外,为了验证我们合成数据的质量,我们使用62,476个合成图表、表格和道路地图指导对一个LMM进行微调。结果表明,图表理解和地图导航性能得到了改善,并展示了对其他视觉推理任务的潜在益处。我们的代码可在以下网址找到:https://github.com/zwq2018/Multi-modal-Self-instruct。
English
Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: https://github.com/zwq2018/Multi-modal-Self-instruct.

Summary

AI-Generated Summary

PDF473November 28, 2024