空間知識グラフに基づくマルチモーダル合成

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の進展により、その能力は大幅に向上している。しかし、空間知覚能力は依然として顕著な制限となっている。この課題に対処するため、マルチモーダルデータ合成は有望な解決策を提供する。ただし、合成されたデータが空間的常識に従うことを保証することは容易な作業ではない。本研究では、知識からデータ生成という概念に基づき、空間知識グラフに導かれた新しいマルチモーダル合成アプローチであるSKG2Dataを紹介する。SKG2Dataは、人間のような空間的方向と距離の知覚を模倣するために空間知識グラフ（SKG）を自動的に構築し、その後、マルチモーダルデータ合成を導くために利用する。広範な実験により、方向や距離を含む多様なタイプの空間知識から合成されたデータは、MLLMの空間知覚と推論能力を向上させるだけでなく、強い汎化能力を示すことが実証された。知識に基づくデータ合成のアイデアが、空間知能の発展を促進することを期待する。

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

空間知識グラフに基づくマルチモーダル合成

Spatial Knowledge Graph-Guided Multimodal Synthesis

要旨

Support