TrajLoom: 映像からの高密度未来軌道生成

要旨

未来の動きを予測することは、映像理解と制御可能な映像生成において極めて重要である。密な点軌跡はコンパクトで表現力豊かな動きの表現であるが、観測された映像からその将来の変化をモデル化することは依然として困難である。本論文では、過去の軌跡と映像コンテキストから将来の軌跡と可視性を予測するフレームワークを提案する。我々の手法は3つの構成要素からなる：(1) 各点をピクセル中心アンカーからのオフセットとして表現することで位置依存バイアスを低減するGrid-Anchor Offset Encoding、(2) マスク復元と時空間一貫性正則化を用いて密な軌跡のコンパクトな時空間潜在空間を学習するTrajLoom-VAE、(3) 境界手がかりと安定したサンプリングのための方策オンK段階ファインチューニングにより、潜在空間内でフローマッチングを通じて将来軌跡を生成するTrajLoom-Flow。さらに、実写映像と合成映像を網羅し、映像生成ベンチマークに沿った標準化された設定を備えた統一ベンチマークTrajLoomBenchを導入する。最先端手法と比較して、本手法は予測時間枠を24フレームから81フレームに拡大するとともに、データセット間で動きの現実性と安定性を向上させる。予測された軌跡は下流の映像生成・編集タスクを直接支援する。コード、モデルチェックポイント、データセットはhttps://trajloom.github.io/で公開されている。

English

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.

TrajLoom: 映像からの高密度未来軌道生成

TrajLoom: Dense Future Trajectory Generation from Video

要旨

Support