马里奥：基于大型语言模型的多模态图推理

摘要

近期大規模語言模型（LLM）的突破為多模態推理開闢了新途徑。然而現有方法大多仍依賴預訓練視覺語言模型（VLM）對圖像-文本對進行孤立編碼，忽略了現實世界多模態數據天然形成的關聯結構。這促使人們轉向多模態圖（MMG）推理——圖中節點同時具備文本與視覺屬性，邊緣則提供結構化線索。要在此類異質多模態信號上實現基於LLM的推理並保持圖拓撲結構，需解決兩大關鍵挑戰：弱跨模態一致性問題與異質模態偏好問題。為此，我們提出Mario統一框架，可同步化解上述挑戰，實現高效的基於LLM的MMG推理。該框架包含兩個創新階段：首先採用圖條件約束的VLM設計，通過圖拓撲指導的細粒度跨模態對比學習聯合優化文本與視覺特徵；其次提出模態自適應圖指令調優機制，將對齊後的多模態特徵組織為圖感知指令視圖，並通過可學習路由器為每個節點及其鄰域動態篩選最富信息量的模態配置傳輸至LLM。在多個MMG基準測試上的廣泛實驗表明，無論在監督學習還是零樣本場景下，Mario在節點分類與鏈接預測任務中均持續優於當前最先進的圖模型。代碼已公開於https://github.com/sunyuanfu/Mario。

English

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

马里奥：基于大型语言模型的多模态图推理

Mario: Multimodal Graph Reasoning with Large Language Models

摘要

Support