TT4D:基于单目视频的乒乓球四维重建流程与数据集
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
May 2, 2026
作者: Nima Rahmanian, Daniel Kienzle, Thomas Gossard, Dvij Kalaria, Rainer Lienhart, Shankar Sastry
cs.AI
摘要
我们推出TT4D——一个大规模高保真度的乒乓球数据集。该数据集通过单目广播视频重构了140余小时的单双打比赛画面,提供多模态标注信息,包括高质量相机标定、精确的三维球体位置、球体旋转参数、时间分段数据以及随时间变化的三维人体网格。这些丰富数据为虚拟回放、深度运动员分析和机器人学习奠定了新基础。我们通过创新性的重构流程实现了数据集规模与精度的统一:传统方法先基于二维球体轨迹将比赛序列分割为独立击球片段再进行重构,但基于二维信息的时间分割会在遮挡和多视角情况下失效,导致重构不可靠。我们颠覆了这一范式,首先通过训练后的升维网络将未分割的完整二维球体轨迹提升至三维空间,再利用三维轨迹实现可靠的时间分割。该升维网络还能推断球体旋转、处理不可靠的球体检测,并在严重遮挡情况下成功重构球体轨迹。这种"先升维"的设计至关重要,我们的流程是当前唯一能从通用视角的单目广播视频中重构乒乓球比赛的方法。我们通过两项下游任务验证了数据集的保真度:估算击球时球拍的姿态与速度,以及训练竞技回合的生成模型。
English
We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides 140+ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.