コートに立つVLM：スポーツにおける空間知能のベンチマーキング

要旨

スポーツは、人間の身体的・認知的限界に挑む活動として長らく広範な関心を集めてきた。視覚言語モデル（VLM）の空間知能に対する関心が高まる中、スポーツは高強度な人間の動作と動的オブジェクト相互作用を理解するための自然な試験場を提供する。この目的に向けて、我々はスポーツシナリオに特化した初の大規模空間知能データセットであるCourtSIを提案する。CourtSIは100万以上のQAペアを含み、バドミントン、テニス、卓球などの代表的なネットスポーツにおける空間的計数、距離測定、位置特定、関係推論を体系的に網羅する統合的分類体系の下に編成されている。明確に定義されたコート幾何学を計量基準として活用し、スポーツシーンを再構築する半自動データエンジンを開発することで、CourtSIのスケーラブルな構築を実現した。さらに、厳格な人手検証を経た3,686のQAペアから構成される高品質評価ベンチマークCourtSI-Benchを導入する。25のプロプライエタリ及びオープンソースVLMをCourtSI-Benchで評価した結果、人間とAIの性能差が残存すること、既存の空間知能ベンチマークからの一般化が限定的であることが明らかになった。これらの知見は、スポーツシナリオが既存ベンチマークで捕捉されていない空間知能能力の限界を露呈することを示唆する。さらに、Qwen3-VL-8BをCourtSIでファインチューニングすると、CourtSI-Benchの精度が23.5ポイント向上した。適応されたモデルは、同様の未学習スポーツに基づく評価セットCourtSI-Extへも効果的に一般化し、空間認識を強化した解説生成能力も示した。これらの知見は総じて、CourtSIがスポーツにおけるVLMの空間知能を発展させるスケーラブルな道筋を提供することを実証している。

English

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

コートに立つVLM：スポーツにおける空間知能のベンチマーキング

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

要旨

Support