DRIFT：面向视觉-语言模型中解码连续输出的残差流适配器

摘要

许多现代视觉语言模型（VLM）依赖于对离散令牌的自回归解码。尽管基于文本的输出接口支持大规模预训练并在多种任务中展现出强大的零样本泛化能力，但对于需要精确连续输出的问题——例如定位事件的时间边界或生成机器人控制动作——这类模型却难以适用。为解决这一挑战，我们提出了DRIFT，一个适用于将预训练VLM适配到连续解码任务的通用框架。DRIFT将基础预测器（提供目标输出的粗估计）与基于流匹配的生成式细化模块相结合，通过迭代方式不断改进预测。这种残差公式化将生成建模问题从学习全局输出分布转变为在强先验基础上建模局部残差分布，从而大幅简化优化过程。我们在感知和规划任务（包括视觉定位与机器人控制）上对DRIFT进行了评估。在跨越多模态大语言模型（MLLM）、视觉语言动作模型（VLA）和世界动作模型（WAM）的多种任务与架构中，DRIFT 都取得了优于一系列强回归与生成式基线的表现。

English

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.