PointArena: 言語誘導型ポインティングによるマルチモーダルグラウンディングの探求

要旨

ポインティングは、視覚的コンテキスト内で言語を接地させるための基本的で直感的なメカニズムとして機能し、ロボティクス、支援技術、インタラクティブAIシステムなど幅広い応用が可能です。近年のマルチモーダルモデルはポインティング機能をサポートし始めていますが、既存のベンチマークは通常、参照対象の物体位置特定タスクにのみ焦点を当てています。本論文では、多様な推論シナリオにわたるマルチモーダルポインティングを評価するための包括的なプラットフォームであるPointArenaを紹介します。PointArenaは3つのコンポーネントで構成されています：(1) 5つの推論カテゴリーにわたる約1,000のポインティングタスクを含む精選されたデータセットであるPoint-Bench、(2) 匿名化された4,500以上の投票が既に集められている、ウェブベースのインタラクティブなペアワイズモデル比較アリーナであるPoint-Battle、(3) 実世界のロボット操作システムであり、ユーザーが実践的な設定で直接マルチモーダルモデルのポインティング能力を評価できるPoint-Actです。我々は、最先端のオープンソースおよびプロプライエタリなマルチモーダルモデルを広範に評価しました。結果は、Molmo-72Bが他のモデルを一貫して上回る一方で、プロプライエタリモデルも同等の性能を示しつつあることを示しています。さらに、ポインティングタスクに特化した教師ありトレーニングがモデルの性能を大幅に向上させることも明らかになりました。多段階評価パイプライン全体を通じて、強い相関関係も観察され、マルチモーダルモデルが抽象的な推論と具体的な現実世界の行動を効果的に橋渡しする上で、正確なポインティング能力が極めて重要であることが強調されました。プロジェクトページ: https://pointarena.github.io/

English

Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/

PointArena: 言語誘導型ポインティングによるマルチモーダルグラウンディングの探求

PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

要旨

Support