视觉语言模型表现如何?基于MeasureBench的视觉测量读数基准测试
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
October 30, 2025
作者: Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
cs.AI
摘要
人类读取测量仪器可谓轻而易举,且几乎无需领域专业知识,但我们在初步评估中发现,这对当前视觉语言模型(VLM)仍具有惊人挑战性。本研究推出MeasureBench——一个涵盖各类真实场景与合成测量图像的视觉测量读取基准,同时提供可扩展的数据合成流程。该流程能按需生成具有可控视觉特征的指定类型仪表,实现指针、刻度、字体、光照及杂波等关键细节的规模化变异。对主流专有模型和开源权重的VLM评估表明,即便是最先进的尖端模型在测量读取任务上仍普遍表现不佳。其中持续性失效模式是指示器定位问题:模型虽能识别数字或标签,却会误判指针或对齐标记的关键位置,导致尽管文本推理合理但数值误差巨大。我们通过合成数据进行了强化学习初步实验,发现在域内合成子集上结果可喜,但对真实图像的泛化能力仍不理想。本分析揭示了当前VLM在细粒度空间定位方面的根本局限。我们期望这一资源能推动视觉基础计算能力与VLM精确空间感知技术的未来发展,弥合数字识别与世界测量之间的鸿沟。
English
Reading measurement instruments is effortless for humans and requires
relatively little domain expertise, yet it remains surprisingly challenging for
current vision-language models (VLMs) as we find in preliminary evaluation. In
this work, we introduce MeasureBench, a benchmark on visual measurement reading
covering both real-world and synthesized images of various types of
measurements, along with an extensible pipeline for data synthesis. Our
pipeline procedurally generates a specified type of gauge with controllable
visual appearance, enabling scalable variation in key details such as pointers,
scales, fonts, lighting, and clutter. Evaluation on popular proprietary and
open-weight VLMs shows that even the strongest frontier VLMs struggle
measurement reading in general. A consistent failure mode is indicator
localization: models can read digits or labels but misidentify the key
positions of pointers or alignments, leading to big numeric errors despite
plausible textual reasoning. We have also conducted preliminary experiments
with reinforcement learning over synthetic data, and find encouraging results
on in-domain synthetic subset but less promising for real-world images. Our
analysis highlights a fundamental limitation of current VLMs in fine-grained
spatial grounding. We hope this resource can help future advances on visually
grounded numeracy and precise spatial perception of VLMs, bridging the gap
between recognizing numbers and measuring the world.