ChatPaper.aiChatPaper

基于物理感知的草图引导验证视频生成规划

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

November 21, 2025
作者: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal
cs.AI

摘要

近期视频生成方法日益依赖规划中间控制信号(如物体轨迹)来提升时间连贯性与运动保真度。然而这些方法多采用单次规划方案,通常仅能生成简单运动,或需多次调用视频生成器进行迭代优化的方案,导致计算成本高昂。为突破这些局限,我们提出SketchVerify——一种基于草图验证的无训练规划框架,通过在完整视频生成前引入测试时采样与验证循环,以更具动态连贯性(即物理合理且符合指令的运动轨迹)的运动规划提升生成质量。给定提示词与参考图像,本方法首先生成多个候选运动规划,再通过视觉语言验证器对其进行联合评估排序,该验证器同时考量指令语义对齐度与物理合理性。为高效评分候选运动轨迹,我们将每条轨迹合成为静态背景上的轻量级视频草图,此举在保持性能相当的同时规避了昂贵的重复扩散合成过程。通过迭代优化运动规划直至获得满意方案,最终将其输入轨迹条件生成器完成合成。在WorldModelBench与PhyWorldBench上的实验表明,相较于基线模型,本方法在运动质量、物理真实感与长程一致性方面均有显著提升,且计算效率大幅优化。消融实验进一步证实,增加轨迹候选数量能持续提升整体性能。
English
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
PDF22December 1, 2025