ChatPaper.aiChatPaper

LaViDa:面向多模態理解的大型擴散語言模型

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

May 22, 2025
作者: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
cs.AI

摘要

現代視覺語言模型(VLMs)能夠解決多種需要視覺推理的任務。在實際應用場景中,理想的VLM特性包括快速推理和可控生成(例如,限制輸出以符合特定格式)。然而,現有的自回歸(AR)VLMs如LLaVA在這些方面表現欠佳。離散擴散模型(DMs)提供了一種有前景的替代方案,通過並行解碼實現更快的推理,並通過文本填充實現雙向上下文以支持可控生成。儘管DMs在純語言環境中效果顯著,但其在多模態任務中的潛力尚未充分探索。我們介紹了LaViDa,這是一系列基於DMs構建的VLMs。我們通過為DMs配備視覺編碼器並聯合微調這些組件以實現多模態指令跟隨,來構建LaViDa。為應對遇到的挑戰,LaViDa採用了多種新技術,如互補掩碼以提升訓練效果,前綴KV緩存以優化推理效率,以及時間步長偏移以確保高質量採樣。實驗表明,LaViDa在多模態基準測試如MMMU上,不僅與AR VLMs競爭力相當甚至更優,還展現了DMs的獨特優勢,包括靈活的質量速度權衡、可控性和雙向推理能力。在COCO圖像描述任務中,LaViDa以1.92倍的速度提升,CIDEr得分超過Open-LLaVa-Next-8B達+4.1。在雙向任務中,其在受限詩歌完成任務上實現了+59%的改進。這些結果證明了LaViDa作為AR VLMs的強有力替代方案。代碼和模型將在最終版本中發布。
English
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Summary

AI-Generated Summary

PDF102May 23, 2025