DRIFT：用於解碼視覺-語言模型中連續輸出的殘差流適配器

摘要

許多現代視覺語言模型（VLM）建基於離散符記的自迴歸解碼。儘管基於文本的輸出介面能實現可擴展的預訓練，並在各種任務中展現強大的零樣本泛化能力，但對於需要精確連續輸出的問題——例如定位事件時間邊界或生成機器人控制動作——這些模型表現不佳。為了解決此挑戰，我們提出DRIFT，這是一個通用框架，用於將預訓練的VLM適應至連續解碼任務。DRIFT結合了一個基礎預測器（提供目標輸出的粗略估計）與一個基於流匹配的生成式精化模組（藉由疊代方式逐步改善預測）。這種殘差公式化將生成式建模問題從學習全域輸出分佈轉變為在強先驗周圍建模局部殘差分佈，大幅簡化優化過程。我們在感知與規劃任務（包括視覺定位與機器人控制）上評估DRIFT。跨越多種任務與架構（涵蓋MLLM、VLA與WAM），DRIFT一致性地優於一系列強大的基於迴歸與生成式的方法。

English

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.