ChatPaper.aiChatPaper

LaViDa-R1:推动统一多模态扩散语言模型的推理能力进阶

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

February 15, 2026
作者: Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen
cs.AI

摘要

扩散语言模型(dLLMs)近期作为自回归大语言模型的有力替代方案崭露头角。最新研究进一步将其扩展至多模态理解与生成任务。本文提出LaViDa-R1——一种多模态通用推理扩散语言模型。与现有通过任务特定强化学习构建推理dLLMs的研究不同,LaViDa-R1以统一方式融合了多样化的多模态理解与生成任务。具体而言,该模型采用创新的统一后训练框架,无缝整合了监督微调(SFT)与多任务强化学习(RL),并运用答案强制、树搜索及互补似然估计等新型训练技术来提升效能与可扩展性。大量实验表明,LaViDa-R1在视觉数学推理、强推理需求的基础任务及图像编辑等多模态任务上均表现出卓越性能。
English
Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
PDF32February 18, 2026