ChatPaper.aiChatPaper

筆記帶來專注?邁向多輪多模態對話學習

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

March 10, 2025
作者: Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu
cs.AI

摘要

基於大規模預訓練視覺塔和語言模型構建的多模態大語言模型(MLLMs),在多模態理解方面展現了卓越的能力。然而,現有的大多數MLLMs僅在單輪視覺問答任務上進行訓練,這並不能準確反映現實世界中的人類對話。本文中,我們引入了MMDiag,一個多輪多模態對話數據集。該數據集通過精心設計的規則和GPT的協助共同生成,其特點在於問題之間、問題與圖像之間以及不同圖像區域之間具有強相關性,從而更貼近現實場景。MMDiag作為多輪多模態對話學習的強力基準,為MLLMs的基礎與推理能力帶來了更多挑戰。此外,受人類視覺處理的啟發,我們提出了DiagNote,這是一個具備多模態基礎與推理能力的MLLM。DiagNote由兩個相互作用的模塊(Deliberate和Gaze)組成,在多輪對話中分別執行思維鏈和註解。我們通過實驗證明了DiagNote在基礎能力以及視覺與語言信息的聯合處理與推理方面相較於現有MLLMs的優勢。
English
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Summary

AI-Generated Summary

PDF402March 11, 2025