ChatPaper.aiChatPaper

讓視覺語言模型踏上球場:運動空間智能基準測試

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

March 10, 2026
作者: Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
cs.AI

摘要

運動長期以來因其挑戰人類生理與認知極限而備受關注。隨著視覺語言模型(VLM)空間智能研究熱度攀升,運動場景為理解高強度人體動作與動態物體互動提供了天然試驗場。為此,我們推出首個專注於運動場景的大規模空間智能數據集CourtSI,包含超過100萬組問答對,並按系統化分類框架組織,全面涵蓋羽毛球、網球、乒乓球等代表性隔網運動中的空間計數、距離測量、定位及關係推理任務。憑藉標準化場地幾何結構作為度量基準,我們開發了半自動化數據引擎重建運動場景,實現CourtSI的可擴展構建。此外,我們提出經嚴格人工校驗的高質量評估基準CourtSI-Bench,包含3,686組問答對。通過對25個專有與開源VLM的測試,發現現有模型存在明顯的人機性能差距,且從傳統空間智能基準遷移的泛化能力有限,證明運動場景能有效暴露當前基準未能捕捉的空間智能缺陷。進一步實驗表明,基於CourtSI微調的Qwen3-VL-8B模型在CourtSI-Bench上的準確率提升23.5個百分點。改進後的模型在基於同類未見運動構建的CourtSI-Ext評估集上展現出良好泛化能力,並顯著提升空間感知型賽事解說生成質量。這些成果共同印證CourtSI為提升VLM在運動領域的空間智能提供了可擴展路徑。
English
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
PDF202March 12, 2026