迈向球场：体育领域空间智能基准测试

摘要

体育运动因其不断挑战人类体能和认知极限而长期备受关注。随着视觉语言模型空间智能研究日益受到重视，体育领域为理解高强度人体运动与动态物体交互提供了天然试验场。为此，我们推出首个面向体育场景的大规模空间智能数据集CourtSI。该数据集包含超过100万组问答对，按照系统覆盖空间计数、距离测量、定位和关系推理的完整分类体系，囊括羽毛球、网球和乒乓球等代表性网类运动。借助明确标定的球场几何结构作为度量基准，我们开发了半自动数据引擎来重建运动场景，实现了CourtSI的可扩展构建。此外，我们推出经过严格人工校验的高质量评估基准CourtSI-Bench，包含3,686组问答对。通过对25个专有和开源VLM的评估，发现当前AI与人类表现仍存在差距，且现有空间智能基准的泛化能力有限。这些结果表明体育场景暴露出现有基准在捕捉空间智能能力方面的局限性。进一步实验显示，基于CourtSI对Qwen3-VL-8B进行微调后，其在CourtSI-Bench上的准确率提升23.5个百分点。适配后的模型在基于相似未见过运动构建的评估集CourtSI-Ext上也展现出色泛化能力，并表现出增强的空间感知解说生成能力。这些发现共同证明CourtSI为提升VLM在体育领域的空间智能提供了可扩展的路径。

English

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.