ChatPaper.aiChatPaper

视觉语言模型表现如何?基于MeasureBench的视觉测量读数基准测试

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

October 30, 2025
作者: Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
cs.AI

摘要

人类读取测量仪器读数轻而易举,且所需领域专业知识相对较少,但我们在初步评估中发现,这对当前视觉语言模型(VLM)仍具有惊人挑战性。本研究推出MeasureBench——一个涵盖真实场景与合成图像中各类测量仪器的视觉读数评测基准,并配套可扩展的数据合成流程。该流程能程序化生成具有可控视觉特征的指定类型仪表,实现指针、刻度、字体、光照及干扰物等关键细节的大规模参数化调整。对主流专有及开源VLM的评测表明,即使最先进的尖端模型在通用测量读数任务中仍表现不佳。一个典型的失效模式是指示器定位:模型能识别数字或标签,却误判指针或对齐标记的关键位置,导致尽管文本推理合理但数值误差巨大。我们通过合成数据进行了强化学习初步实验,发现在合成数据子集上效果显著,但对真实图像的泛化能力有限。本分析揭示了当前VLM在细粒度空间定位方面的根本局限。我们希望该资源能推动视觉基础计算能力与VLM精确空间感知的研究进展,弥合数字识别与世界测量之间的鸿沟。
English
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
PDF111January 19, 2026