ChatPaper.aiChatPaper

LaViDa:面向多模态理解的大规模扩散语言模型

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

May 22, 2025
作者: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
cs.AI

摘要

现代视觉-语言模型(VLMs)能够解决多种需要视觉推理的任务。在实际应用场景中,理想的VLM特性包括快速推理和可控生成(例如,约束输出以符合特定格式)。然而,现有的自回归(AR)VLMs,如LLaVA,在这些方面表现欠佳。离散扩散模型(DMs)提供了一种有前景的替代方案,通过并行解码实现更快的推理,并通过文本填充实现双向上下文,从而支持可控生成。尽管DMs在纯语言环境中表现出色,但其在多模态任务中的潜力尚未充分挖掘。我们推出了LaViDa,一个基于DMs构建的VLM系列。LaViDa通过为DMs配备视觉编码器,并联合微调这些组件以实现多模态指令跟随。针对遇到的挑战,LaViDa引入了多项创新技术,如互补掩码以提升训练效果,前缀KV缓存以优化推理效率,以及时间步长偏移以确保高质量采样。实验表明,LaViDa在多模态基准测试如MMMU上,不仅与AR VLMs竞争或超越其性能,还展现了DMs的独特优势,包括灵活的速度-质量权衡、可控性及双向推理能力。在COCO图像描述任务中,LaViDa以1.92倍的速度提升,CIDEr得分超过Open-LLaVa-Next-8B达4.1分。在双向任务上,如受限诗歌补全,LaViDa实现了59%的性能提升。这些成果证明了LaViDa作为AR VLMs强有力的替代选择。代码和模型将在最终版本中公开发布。
English
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Summary

AI-Generated Summary

PDF102May 23, 2025