從去噪到精煉:視覺語言擴散模型的校正框架
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
October 22, 2025
作者: Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo
cs.AI
摘要
離散擴散模型在視覺語言任務領域嶄露頭角,憑藉其雙向上下文建模能力和理論上的並行化優勢展現出巨大潛力。然而,訓練與推斷間的顯著差異嚴重阻礙了其實際應用,這種差異會引發災難性的錯誤級聯:並行解碼過程中初始符號的錯誤會污染生成上下文,觸發錯誤疊加的連鎖反應,最終導致語法錯誤和語義幻覺。為解決這一根本性挑戰,我們將生成過程重新定義為從被動去噪轉向主動精煉。本文提出ReDiff——一種增強精煉的擴散框架,該框架教導模型識別並修正自身錯誤。我們的方法採用兩階段訓練流程:首先通過訓練模型修正合成錯誤來奠定基礎修訂能力;隨後實施新穎的線上自校正循環,讓模型通過學習專家校正結果來明確訓練其修正自身缺陷草稿的能力。這種錯誤驅動的學習賦予模型關鍵能力,使其能夠重新審視並優化已生成的輸出,從而有效阻斷錯誤級聯。大量實驗表明,ReDiff顯著提升了生成內容的連貫性與事實準確性,實現了遠超傳統去噪方法的穩定高效並行生成。相關代碼與模型已開源於https://rediff-hku.github.io/。
English
Discrete diffusion models have emerged as a promising direction for
vision-language tasks, offering bidirectional context modeling and theoretical
parallelization. However, their practical application is severely hindered by a
train-inference discrepancy, which leads to catastrophic error cascades:
initial token errors during parallel decoding pollute the generation context,
triggering a chain reaction of compounding errors and leading to syntactic
errors and semantic hallucinations. To address this fundamental challenge, we
reframe the generation process from passive denoising to active refining. We
introduce ReDiff, a refining-enhanced diffusion framework that teaches the
model to identify and correct its own errors. Our approach features a two-stage
training process: first, we instill a foundational revision capability by
training the model to revise synthetic errors; second, we implement a novel
online self-correction loop where the model is explicitly trained to revise its
own flawed drafts by learning from an expert's corrections. This mistake-driven
learning endows the model with the crucial ability to revisit and refine its
already generated output, effectively breaking the error cascade. Extensive
experiments demonstrate that ReDiff significantly improves the coherence and
factual accuracy of generated content, enabling stable and efficient parallel
generation far superior to traditional denoising methods. Our codes and models
are available at https://rediff-hku.github.io/.