手の動作再構成におけるビデオ拡散モデルの驚くべき有効性

要旨

一人称視点ビデオからの4Dハンドモーション再構築は、既存手法の明確な限界によってボトルネックとなっている。画像ベースのパイプラインは、重度の遮蔽下で失敗する検出器に依存する一方、ビデオベースの手法は、乏しい手指姿勢アノテーションからのみ学習される時間モジュールに依存しており、これは動作ダイナミクス、遮蔽推論、手と物体のインタラクションをモデル化するには不十分な狭い信号である。しかしながら、これらの能力はまさに、ビデオ生成モデルがインターネット規模で首尾一貫したビデオを合成するように訓練される際に、暗黙的に獲得しなければならないものである。これに動機づけられて、我々はViDiHandを提案する。これは、事前学習済みビデオ拡散モデルの表現を活用して4Dの両手姿勢を再構築する。我々は、その世界事前知識を保持しながら手に特化した特徴を備えるように、ハンドオーバーレイレンダリング目的関数を介してこれを適応させる。次にデコーダが、適応された特徴からメートルスケールの姿勢を復元する。パイプライン全体は、検出器、補完器、テスト時最適化を一切用いずに、完全なフレームに対して直接動作する。ARCTIC、HOT3D、HOI4Dにおいて、ViDiHandは従来手法を大幅に上回り、ビデオ拡散モデルが手指動作再構築のための強力な新しい基盤であり、身体化AIのためのスケーラブルな実環境データ収集への有望な経路であることを確立する。プロジェクトページ: https://vidihand.github.io

English

4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.