SViMo: 手と物体のインタラクションシナリオにおけるビデオとモーション生成のための同期拡散モデル

要旨

手と物体の相互作用（HOI）生成は、重要な応用可能性を秘めています。しかし、現在の3D HOIモーション生成手法は、事前に定義された3D物体モデルと実験室でキャプチャされたモーションデータに大きく依存しており、汎化能力が制限されています。一方、HOIビデオ生成手法は、ピクセルレベルの視覚的忠実度を優先し、物理的な妥当性を犠牲にすることが多いです。視覚的外観とモーションパターンが現実世界で基本的な物理法則を共有していることを認識し、我々は、視覚的プライアと動的制約を同期拡散プロセス内で組み合わせてHOIビデオとモーションを同時に生成する新しいフレームワークを提案します。異種のセマンティクス、外観、およびモーション特徴を統合するために、我々の手法は、特徴の整合化のためにトリモーダル適応変調を実装し、モーダル間およびモーダル内の依存関係をモデル化するために3Dフルアテンションを組み合わせます。さらに、同期拡散出力から直接明示的な3D相互作用シーケンスを生成し、それらをフィードバックして閉ループフィードバックサイクルを確立するビジョン認識型3D相互作用拡散モデルを導入します。このアーキテクチャは、事前に定義された物体モデルや明示的なポーズガイダンスへの依存を排除し、ビデオとモーションの一貫性を大幅に向上させます。実験結果は、我々の手法が、高忠実度で動的に妥当なHOIシーケンスを生成する際に最先端の手法を凌駕し、未見の現実世界シナリオでの顕著な汎化能力を示しています。プロジェクトページはhttps://github.com/Droliven/SViMo\_projectにあります。

English

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo\_project.

SViMo: 手と物体のインタラクションシナリオにおけるビデオとモーション生成のための同期拡散モデル

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

要旨

Support