エージェント対シミュレーション：カジュアルな長期ビデオからインタラクティブな行動モデルを学習する

要旨

私たちは、Agent-to-Sim（ATS）を提案します。これは、3Dエージェントのインタラクティブな行動モデルを、カジュアルな長期ビデオコレクションから学習するためのフレームワークです。従来の手法とは異なり、ATSはマーカーベースのトラッキングやマルチビューカメラに頼らず、動物や人間のエージェントの自然な行動をビデオ観察を通じて非侵襲的に学習します。これらのビデオは、単一の環境で長期間（例：1ヶ月）記録されます。エージェントの3D行動をモデリングするには、長期間にわたる持続的な3Dトラッキング（例：どの点がどれに対応するかを知る）が必要です。このようなデータを取得するために、私たちは、エージェントとカメラを時間の経過とともに、標準的な3D空間を通じて追跡する粗-細の登録方法を開発し、完全で持続的な時空間4D表現を得ます。その後、エージェントの知覚と動きのペアデータを使用してエージェントの行動の生成モデルをトレーニングします。ATSにより、エージェントのビデオ記録からインタラクティブな行動シミュレータへのリアルからシムの転送が可能となります。私たちは、スマートフォンで撮影された単眼RGBDビデオを使用して、ペット（例：猫、犬、ウサギ）や人間に関する結果を示します。

English

We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.

エージェント対シミュレーション：カジュアルな長期ビデオからインタラクティブな行動モデルを学習する

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

要旨

Support