基于草图引导验证的物理感知视频生成规划

摘要

近期视频生成方法日益依赖规划中间控制信号（如物体轨迹）来提升时间连贯性与运动保真度。然而这些方法多采用单次规划方案（通常仅适用于简单运动）或需要多次调用视频生成器的迭代优化方案，导致计算成本高昂。为突破这些限制，我们提出SketchVerify——一种基于草图验证的无训练规划框架，通过在生成完整视频前引入测试时采样与验证循环，以更具动态连贯性的轨迹（即物理合理且符合指令的运动）提升运动规划质量。给定提示词与参考图像，本方法会预测多个候选运动方案，并利用视觉语言验证器从语义指令对齐度和物理合理性两个维度进行联合评估排序。为高效评分候选运动方案，我们将每条轨迹通过静态背景上的物体合成渲染为轻量级视频草图，在保持相当性能的同时规避了昂贵的重复性扩散合成需求。通过迭代优化运动方案直至获得满意结果，最终将其输入轨迹条件生成器进行合成。在WorldModelBench与PhyWorldBench上的实验表明，相较于基线模型，本方法在运动质量、物理真实感与长时一致性方面均有显著提升，且计算效率大幅提高。消融实验进一步证明，增加轨迹候选方案数量能持续提升整体性能。

English

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

基于草图引导验证的物理感知视频生成规划

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

摘要

Support