DRIFT: 비전-언어 모델에서 연속 출력 디코딩을 위한 잔차 흐름 어댑터

초록

많은 현대 비전-언어 모델(VLM)은 이산 토큰의 자기회귀 디코딩을 기반으로 한다. 텍스트 기반 출력 인터페이스는 확장 가능한 사전 학습과 다양한 작업에서 강력한 제로샷 일반화를 가능하게 하지만, 사건의 시간적 경계 위치 파악이나 로봇 제어 동작 생성과 같이 정밀한 연속 출력이 필요한 문제에는 적합하지 않다. 이 문제를 해결하기 위해, 우리는 사전 학습된 VLM을 연속 디코딩 작업에 적용하기 위한 일반 프레임워크인 DRIFT를 제안한다. DRIFT는 대상 출력의 대략적인 추정치를 제공하는 기본 예측기와, 흐름 매칭을 기반으로 하여 예측을 반복적으로 개선하는 생성적 정제 모듈을 결합한다. 이러한 잔차 공식화는 생성 모델링 문제를 전역 출력 분포를 학습하는 것에서 강력한 사전 분포 주변의 국소적 잔차 분포를 모델링하는 것으로 변환하여 최적화를 크게 단순화한다. 우리는 시각적 근거 찾기와 로봇 제어를 포함한 인식 및 계획 작업 모두에서 DRIFT를 평가한다. MLLM, VLA, WAM에 걸친 여러 작업과 아키텍처에서 DRIFT는 강력한 회귀 기반 및 생성 기반 솔루션들을 일관되게 능가한다.

English

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.