ChatPaper.aiChatPaper

ReVSeg:基于强化学习的视频分割推理链激励机制

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

December 2, 2025
作者: Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu
cs.AI

摘要

以推理为核心的视频目标分割本质上是一项复杂任务:查询往往涉及动态变化、因果关系和时间交互,而非静态外观特征。然而现有解决方案通常将这些因素压缩为潜在嵌入的简化推理,导致推理链变得不透明且本质上难以追踪。为此,我们采用显式分解的视角提出ReVSeg,该方法在预训练视觉语言模型(VLM)的原生接口中通过序列化决策执行推理。与将所有推理折叠为单步预测不同,ReVSeg通过语义解析、时序证据筛选和空间定位三个显式操作,实现对预训练能力的对齐。我们进一步采用强化学习优化多步推理链,使模型能够根据结果导向的信号自我优化决策质量。实验结果表明,ReVSeg在标准视频目标分割基准测试中达到最先进性能,并生成可解释的推理轨迹。项目页面详见 https://clementine24.github.io/ReVSeg/。
English
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
PDF92December 9, 2025