マリオ: 大規模言語モデルを用いたマルチモーダルグラフ推論

要旨

大規模言語モデル（LLM）の最近の進歩は、マルチモーダル推論に新たな可能性をもたらした。しかし、既存手法の多くは依然として、事前学習済み視覚言語モデル（VLM）に依存し、画像とテキストのペアを個別に符号化しており、実世界のマルチモーダルデータが自然に形成する関係構造を無視している。この問題意識から、各ノードがテキスト属性と視覚属性を持ち、エッジが構造的手がかりを提供するマルチモーダルグラフ（MMG）上での推論が動機付けられる。グラフトポロジーを保ちつつ、このような異種混合のマルチモーダル信号に対してLLMベースの推論を可能にするには、2つの重要な課題が生じる：弱いクロスモーダル一貫性の解決と、異種モダリティ選好の扱いである。これらを解決するため、我々はMarioを提案する。これは上述の2課題を同時に解決し、MMG上での効果的なLLMベース推論を可能にする統一フレームワークである。Marioは2つの革新的段階から構成される。第1に、グラフトポロジーに導かれたきめ細かいクロスモーダル対比学習を通じて、テキスト特徴量と視覚特徴量を共同で精緻化するグラフ条件付きVLM設計である。第2に、整列されたマルチモーダル特徴量をグラフ認識型命令ビューに組織化し、学習可能なルータを用いて、各ノードとその近傍に対してLLMに最も情報量の多いモダリティ構成を提示する、モダリティ適応型グラフ命令チューニング機構である。多様なMMGベンチマークにおける大規模実験により、Marioがノード分類とリンク予測の両タスクにおいて、教師あり及びゼロショットシナリオで一貫して最先端のグラフモデルを凌駕することを実証した。コードはhttps://github.com/sunyuanfu/Mario で公開予定である。

English

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

マリオ: 大規模言語モデルを用いたマルチモーダルグラフ推論

Mario: Multimodal Graph Reasoning with Large Language Models

要旨

Support