DRIFT: 視覚言語モデルにおける連続出力をデコードするための残差フローアダプタ

要旨

多くの現代の視覚言語モデル（VLM）は、離散トークンの自己回帰デコードを基盤としている。テキストベースの出力インターフェースは、スケーラブルな事前学習と多様なタスクにおける強力なゼロショット汎化を可能にする一方で、イベントの時間的境界の局所化やロボット制御動作の生成など、精密な連続出力を必要とする問題には不向きである。この課題に対処するため、我々は事前学習済みVLMを連続デコードタスクに適応させるための汎用フレームワークであるDRIFTを提案する。DRIFTは、対象出力の粗い推定を提供するベース予測器と、フローマッチングに基づいて予測を反復的に改善する生成的洗練モジュールを組み合わせる。この残差定式化により、生成モデリングの問題は、大域的な出力分布の学習から、強力な事前分布の周りの局所的な残差分布のモデリングへと変換され、最適化が大幅に簡略化される。DRIFTを、視覚的グラウンディングやロボット制御を含む知覚および計画タスクの両方で評価した。MLLM、VLA、WAMにわたる複数のタスクとアーキテクチャにおいて、DRIFTは強力な回帰ベースおよび生成ベースのソリューション群を一貫して上回る性能を示す。

English

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.