ChatPaper.aiChatPaper

LaViDa-R1:推动统一多模態擴散語言模型的推理能力進階

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

February 15, 2026
作者: Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen
cs.AI

摘要

擴散式語言模型(dLLMs)近期作為自迴歸大語言模型的一種潛在替代方案嶄露頭角。最新研究進一步將其擴展至多模態理解與生成任務。本文提出LaViDa-R1——一個多模態通用推理擴散式語言模型。與現有通過任務特定強化學習構建推理dLLMs的方法不同,LaViDa-R1以統一方式融合了多樣化的多模態理解與生成任務。具體而言,該模型採用創新的統一後訓練框架,無縫整合監督微調(SFT)與多任務強化學習(RL),並運用答案強制生成、樹狀搜索及互補似然估計等新穎訓練技術,有效提升了模型效能與擴展性。大量實驗表明,LaViDa-R1在視覺數學推理、強推理需求定位及圖像編輯等多模態任務中均展現出卓越性能。
English
Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
PDF32February 18, 2026