讓視覺語言模型踏上球場：運動空間智能基準測試

摘要

運動長期以來因其挑戰人類生理與認知極限而備受關注。隨著視覺語言模型（VLM）空間智能研究熱度攀升，運動場景為理解高強度人體動作與動態物體互動提供了天然試驗場。為此，我們推出首個專注於運動場景的大規模空間智能數據集CourtSI，包含超過100萬組問答對，並按系統化分類框架組織，全面涵蓋羽毛球、網球、乒乓球等代表性隔網運動中的空間計數、距離測量、定位及關係推理任務。憑藉標準化場地幾何結構作為度量基準，我們開發了半自動化數據引擎重建運動場景，實現CourtSI的可擴展構建。此外，我們提出經嚴格人工校驗的高質量評估基準CourtSI-Bench，包含3,686組問答對。通過對25個專有與開源VLM的測試，發現現有模型存在明顯的人機性能差距，且從傳統空間智能基準遷移的泛化能力有限，證明運動場景能有效暴露當前基準未能捕捉的空間智能缺陷。進一步實驗表明，基於CourtSI微調的Qwen3-VL-8B模型在CourtSI-Bench上的準確率提升23.5個百分點。改進後的模型在基於同類未見運動構建的CourtSI-Ext評估集上展現出良好泛化能力，並顯著提升空間感知型賽事解說生成質量。這些成果共同印證CourtSI為提升VLM在運動領域的空間智能提供了可擴展路徑。

English

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

讓視覺語言模型踏上球場：運動空間智能基準測試

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

摘要

Support