ChatPaper.aiChatPaper

视觉基础后训练学习指南

Watch Before You Answer: Learning from Visually Grounded Post-Training

April 6, 2026
作者: Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen
cs.AI

摘要

视觉语言模型(VLMs)能否全面理解视觉、时序与文本线索至关重要。然而尽管多模态建模进展迅速,视频理解性能仍落后于基于文本的推理。本研究发现,实际进展比既往认知更为滞后:常用长视频理解基准中40-60%的问题仅通过文本线索即可作答。此外,我们发现该问题在广泛使用的后训练数据集中普遍存在,这可能削弱后训练提升VLM视频理解能力的效能。基于此发现,我们提出VidGround这一简洁有效的解决方案:在后训练中仅使用真正需要视觉定位的问题,完全排除语言偏差。当与基于强化学习的后训练算法结合时,此方法仅需原后训练数据量的69.1%,即可实现比完整数据集高6.2个百分点的性能提升。更重要的是,我们证明采用简单后训练算法的数据优化策略可超越多种复杂后训练技术,凸显数据质量是提升VLM视频理解能力的主要瓶颈。这些结果强调,必须构建真正需要视觉定位的后训练数据与评估基准,才能推动更强VLMs的发展。项目页面:http://vidground.etuagi.com。
English
It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.
PDF241April 9, 2026