멀티모달 셀프-인스트럭트: 언어 모델을 활용한 합성 추상 이미지 및 시각적 추론 인스트럭션 생성

초록

현재 대부분의 대형 멀티모달 모델(LMM)은 자연 경관 사진이나 인물 사진을 이해할 수 있지만, 차트, 지도, 레이아웃과 같은 추상적 이미지에 대한 이해와 시각적 추론 능력은 여전히 매우 초보적인 수준에 머물러 있습니다. 이러한 모델들은 시계에서 시간을 읽거나, 플로우차트를 이해하거나, 도로 지도를 사용해 경로를 계획하는 것과 같은 간단한 일상적 작업에서도 종종 어려움을 겪습니다. 이를 고려하여, 우리는 대형 언어 모델과 그 코드 생성 능력을 활용하여 일상 시나리오 전반에 걸친 대량의 추상적 이미지와 시각적 추론 지침을 합성하는 멀티모달 자기 지도(multi-modal self-instruct) 방식을 설계했습니다. 우리의 전략은 차트, 테이블, 시뮬레이션 지도, 대시보드, 플로우차트, 관계 그래프, 평면도, 시각적 퍼즐 등 8가지 시각적 시나리오에 대한 11,193개의 지침으로 구성된 멀티모달 벤치마크를 손쉽게 생성합니다. 이 벤치마크는 단순한 선과 기하학적 요소로 구성되어 있어 Claude-3.5-Sonnet 및 GPT-4o와 같은 최첨단 LMM들이 추상적 이미지 이해, 공간 관계 추론, 시각적 요소 유도에서 보이는 한계를 드러냅니다. 또한, 우리는 합성 데이터의 품질을 검증하기 위해 62,476개의 합성 차트, 테이블, 도로 지도 지침을 사용하여 LMM을 미세 조정했습니다. 그 결과, 차트 이해와 지도 내비게이션 성능이 개선되었으며, 다른 시각적 추론 작업에서도 잠재적 이점이 있음을 보여주었습니다. 우리의 코드는 https://github.com/zwq2018/Multi-modal-Self-instruct에서 확인할 수 있습니다.

English

Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: https://github.com/zwq2018/Multi-modal-Self-instruct.

멀티모달 셀프-인스트럭트: 언어 모델을 활용한 합성 추상 이미지 및 시각적 추론 인스트럭션 생성

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

초록

Support