TT4D: 単眼映像からの卓球4D再構成のためのパイプラインとデータセット

要旨

我々は、大規模で高精度な卓球データセット「TT4D」を提案する。本データセットは、単眼放送映像から再構築された140時間以上のシングルス・ダブルスの試合記録を提供し、高品質なカメラ校正、精密な3Dボール位置、ボールの回転、時間セグメンテーション、時間経過に伴う3D人体メッシュなど、マルチモーダルな注釈を特徴とする。この豊富なデータは、仮想リプレイ、詳細なプレイヤー分析、ロボット学習の新たな基盤を提供する。データセットの規模と精度の両立は、新規の再構築パイプラインによって実現されている。従来手法では、まず2Dボール軌道に基づいてゲームシーケンスを個々のショットセグメントに分割し、その後で再構築を試みる。しかし、2Dベースの時間セグメンテーションは、オクルージョンや多様なカメラ視点下では破綻し、信頼性のある再構築を妨げる。我々はこのパラダイムを逆転させ、学習済みリフティングネットワークを通じて、未分割の2Dボール軌道全体をまず3Dへ昇華させる。この3D軌道により、信頼性の高い時間セグメンテーションが可能となる。学習済みリフティングネットワークはさらにボールの回転を推定し、信頼性の低いボール検出を処理し、高度なオクルージョン下でもボール軌道の再構築に成功する。この「リフトファースト」設計は必須である。なぜなら、我々のパイプラインは、一般的な視点の単眼放送映像から卓球の試合再構築が可能な唯一の手法であるからだ。データセットの精度は、2つの下流タスク（インパクト時のラケット姿勢・速度推定、および競技的なラリーの生成モデル学習）を通じて実証する。

English

We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides 140+ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.

TT4D: 単眼映像からの卓球4D再構成のためのパイプラインとデータセット

TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

要旨

Support