TrajLoom：基于视频的密集未来轨迹生成

摘要

未来运动预测在视频理解和可控视频生成中至关重要。密集点轨迹作为一种紧凑且富有表现力的运动表示方式，但根据观测视频建模其未来演化仍具挑战性。我们提出一个通过历史轨迹和视频上下文预测未来轨迹及可见度的框架。该方法包含三个核心组件：（1）网格锚点偏移编码，通过将每个点表示为相对于像素中心锚点的偏移量，降低位置依赖性偏差；（2）TrajLoom-VAE，通过掩码重建和时空一致性正则化器，学习密集轨迹的紧凑时空潜在空间；（3）TrajLoom-Flow，通过流匹配在潜在空间生成未来轨迹，结合边界提示和在线K步微调实现稳定采样。我们还推出了TrajLoomBench基准测试平台，该统一基准涵盖真实与合成视频，采用与视频生成基准对齐的标准化设置。相比现有最优方法，我们的方案将预测范围从24帧扩展至81帧，同时在多个数据集上提升运动真实性与稳定性。预测轨迹可直接支持下游视频生成与编辑任务。代码、模型检查点及数据集详见https://trajloom.github.io/。

English

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.