ChatPaper.aiChatPaper

Plan-X:基于语义规划的教学视频生成

Plan-X: Instruct Video Generation via Semantic Planning

November 22, 2025
作者: Lun Huang, You Xie, Hongyi Xu, Tianpei Gu, Chenxu Zhang, Guoxian Song, Zenan Li, Xiaochen Zhao, Linjie Luo, Guillermo Sapiro
cs.AI

摘要

扩散变换器在视觉合成领域展现出卓越能力,但在高级语义推理和长程规划方面仍存在不足。这种局限性常导致视觉幻觉现象以及与用户指令的错位,尤其涉及复杂场景理解、人物-物体交互、多阶段动作和情境运动推理的场景。为应对这些挑战,我们提出Plan-X框架,通过显式强化高级语义规划来指导视频生成过程。其核心是语义规划器——一个可学习的多模态语言模型,能够基于文本提示和视觉上下文对用户意图进行推理,并自回归生成一系列基于文本的时空语义标记。这些语义标记与高级文本提示引导形成互补,随时间推移构成视频扩散模型的结构化"语义草图",而后者擅长合成高保真视觉细节。Plan-X有效融合了语言模型在多模态情境推理与规划方面的优势,以及扩散模型在逼真视频合成方面的特长。大量实验表明,我们的框架能显著减少视觉幻觉,实现与多模态语境一致、符合指令要求的细粒度视频生成。
English
Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.
PDF182February 7, 2026