코트에 오른 VLMs: 스포츠에서의 공간 지능 벤치마킹

초록

스포츠는 인간의 신체적·인지적 한계를 끊임없이 확장해왔기 때문에 오랫동안 광범위한 관심을 받아왔다. 시각-언어 모델(VLM)의 공간 지능에 대한 관심이 높아지는 가운데, 스포츠는 고강도 인간 동작과 동적 객체 상호작용을 이해하기 위한 자연스러운 실험장을 제공한다. 이에 우리는 스포츠 시나리오에 특화된 최초의 대규모 공간 지능 데이터셋인 CourtSI를 소개한다. CourtSI는 배드민턴, 테니스, 탁구 등 대표적인 네트 스포츠를 아우르며 공간 계수, 거리 측정, 위치 특정, 관계적 추론을 체계적으로 포괄하는 통합적 분류 체계 아래 100만 개 이상의 질의-응답 쌍으로 구성된다. 명확히 정의된 코트 기하학을 측정 기준점으로 활용하여 스포츠 장면을 재구성하는 반자동 데이터 엔진을 개발함으로써 CourtSI의 확장 가능한 구축을 가능하게 했다. 또한, 엄격한 인간 검증을 거친 3,686개의 질의-응답 쌍으로 구성된 고품질 평가 벤치마크인 CourtSI-Bench를 도입했다. CourtSI-Bench에서 25개의 사적 및 오픈소스 VLM을 평가한 결과, 여전히 인간과 AI 간 성능 격차가 존재하며 기존 공간 지능 벤치마크로부터의 일반화 능력이 제한적임을 확인했다. 이러한 결과는 스포츠 시나리오가 기존 벤치마크가 포착하지 못한 공간 지능 능력의 한계를 드러낸다는 것을 시사한다. 더 나아가, Qwen3-VL-8B 모델을 CourtSI로 미세 조정하면 CourtSI-Bench 정확도가 23.5%p 향상되었다. 적응된 모델은 유사하지만 학습 과정에 노출되지 않은 스포츠를 기반으로 구축된 평가 세트인 CourtSI-Ext에서도 효과적으로 일반화되었으며, 공간 인식형 해설 생성 능력도 향상된 것으로 나타났다. 이러한 결과들은 CourtSI가 스포츠 분야에서 VLM의 공간 지능을 발전시키기 위한 확장 가능한 경로를 제공함을 입증한다.

English

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

코트에 오른 VLMs: 스포츠에서의 공간 지능 벤치마킹

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

초록

Support