PROGRESSLM:迈向视觉语言模型的进展推理能力
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
January 21, 2026
作者: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
cs.AI
摘要
任务进度估算需要对长时程动态进行推理,而非仅识别静态视觉内容。尽管现代视觉语言模型在描述可见内容方面表现出色,但其能否通过局部观察推断任务进度仍不明确。为此,我们推出Progress-Bench基准测试体系,用于系统评估VLM的进度推理能力。除基准测试外,我们进一步通过免训练的提示工程和基于精选数据集ProgressLM-45K的训练方法,探索了受人类启发的两阶段进度推理范式。对14个VLM的实验表明,大多数模型尚未具备任务进度估算能力,表现出对演示模态和视角变化的敏感性,以及对不可回答案例的薄弱处理能力。虽然强制执行结构化进度推理的免训练提示方法能带来有限且模型依赖的性能提升,但基于训练的ProgressLM-3B即使在小规模模型下也能实现持续改进——尽管其训练任务集与评估任务集完全不相交。进一步分析揭示了典型错误模式,并明确了进度推理成功或失败的条件与成因。
English
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.