迈向未来，一步一个脚印

摘要

要精准预测复杂多样场景的演化过程，需要模型具备表征不确定性的能力、执行长序列交互仿真的能力，以及高效探索多种合理未来的能力。然而现有方法大多依赖稠密视频或潜空间预测，将大量计算资源耗费在稠密外观特征上，而非关注场景中稀疏的点轨迹这一本质要素。这导致大规模未来假设探索成本高昂，且在长时程、多模态运动预测任务中性能受限。我们通过将开放集场景动态预测构建为基于稀疏点轨迹的逐步推理来解决该问题。我们的自回归扩散模型通过局部可预测的短时状态推进这些轨迹，显式建模随时间增长的不确定性。这种以动力学为核心的表征方式能够从单张图像快速推演出数千种不同未来，并支持通过运动初始约束进行定向生成，同时保持物理合理性与长程一致性。我们还提出了OWM基准数据集——基于多样化真实世界视频的开放集运动预测评估体系，用于衡量真实不确定性环境下轨迹分布预测的准确性与多样性。本方法在预测精度上媲美甚至超越稠密仿真器，同时实现数量级级的采样加速，使开放集未来预测兼具可扩展性与实用性。项目页面：http://compvis.github.io/myriad。

English

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.