PROGRESSLM:迈向视觉语言模型的进展推理能力
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
January 21, 2026
作者: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
cs.AI
摘要
任务进度估算需要对长期动态进行推理,而非简单识别静态视觉内容。尽管现代视觉语言模型在描述可见内容方面表现出色,但其能否通过局部观察推断任务进度仍不明确。为此,我们推出Progress-Bench基准测试体系,用于系统评估视觉语言模型的进度推理能力。除基准测试外,我们进一步通过免训练的提示工程和基于ProgressLM-45K精选数据集的训练方法,探索了类人类的两阶段进度推理范式。对14个视觉语言模型的实验表明,大多数模型尚未具备任务进度估算能力,表现出对演示模态和视角变化的敏感性,且难以处理不可回答的情况。虽然强制结构化进度推理的免训练提示方法能带来有限且模型依赖的性能提升,但基于训练的ProgressLM-3B模型即使在小规模架构下,也能在与评估任务完全无关的任务集上实现持续改进。进一步分析揭示了典型错误模式,并阐明了进度推理成功或失败的具体条件及原因。
English
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.