MolmoPoint：基于定位令牌的视觉语言模型精准指向优化

摘要

视觉语言模型（VLMs）的定位能力已成为其基础功能。现有模型大多通过生成坐标作为文本输出的一部分进行指向，这种方法需要学习复杂的坐标系且会导致标记数量激增。我们提出了一种更直观的指向机制：直接选择包含目标概念的视觉标记。我们的模型生成特殊指向标记，通过交叉注意力机制关联输入图像或视频标记并精准选择目标。为实现更细粒度定位，我们在初始指向标记后引入第二级特殊标记用于在选定区域内选择细分子区域，再通过第三级标记精确定位于区域内的具体位置。实验表明，采用固定顺序生成指向点、编码前序点的相对位置信息，以及在选择视觉标记时加入"终止指向"特殊类别，可有效提升性能。该方法在图像指向任务上刷新纪录（PointBench数据集达70.7%），在图形界面指向任务中成为全开源模型的新标杆（ScreenSpotPro数据集达61.1%），同时显著提升视频指向任务表现（相较于文本坐标基线获得59.1%的人类偏好胜率）和跟踪性能（Molmo2Track数据集提升6.3%）。研究还表明该方法具有更高的样本效率，文中进一步讨论了这种设计变革带来的质性差异。

English

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.