Code2Video:面向教育视频生成的代码中心化范式
Code2Video: A Code-centric Paradigm for Educational Video Generation
October 1, 2025
作者: Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou
cs.AI
摘要
尽管近期生成模型在像素级视频合成方面取得了进展,但在制作专业教育视频时仍显不足,这类视频需要学科知识、精确的视觉结构和连贯的转场,限制了其在教育场景中的应用。直观上,这些要求更适合通过可渲染环境的操控来满足,该环境可通过逻辑命令(如代码)进行明确控制。在本研究中,我们提出了Code2Video,一个以代码为中心的代理框架,通过可执行的Python代码生成教育视频。该框架包含三个协作代理:(i)规划器,将讲座内容组织成时间上连贯的流程,并准备相应的视觉素材;(ii)编码器,将结构化指令转换为可执行的Python代码,同时引入范围引导的自动修复以提高效率;(iii)评审器,利用视觉语言模型(VLM)结合视觉锚点提示,优化空间布局并确保清晰度。为支持系统评估,我们构建了MMMC,一个由专业制作、针对特定学科的教育视频基准。我们从多个维度评估MMMC,包括VLM作为评判者的美学评分、代码效率,特别是TeachQuiz,这是一个新颖的端到端指标,量化了VLM在去学习后通过观看生成视频恢复知识的能力。我们的结果表明,Code2Video作为一种可扩展、可解释且可控的方法,相比直接代码生成提升了40%的效果,生成的视频可与人工制作的教程相媲美。代码和数据集可在https://github.com/showlab/Code2Video获取。
English
While recent generative models advance pixel-space video synthesis, they
remain limited in producing professional educational videos, which demand
disciplinary knowledge, precise visual structures, and coherent transitions,
limiting their applicability in educational scenarios. Intuitively, such
requirements are better addressed through the manipulation of a renderable
environment, which can be explicitly controlled via logical commands (e.g.,
code). In this work, we propose Code2Video, a code-centric agent framework for
generating educational videos via executable Python code. The framework
comprises three collaborative agents: (i) Planner, which structures lecture
content into temporally coherent flows and prepares corresponding visual
assets; (ii) Coder, which converts structured instructions into executable
Python codes while incorporating scope-guided auto-fix to enhance efficiency;
and (iii) Critic, which leverages vision-language models (VLM) with visual
anchor prompts to refine spatial layout and ensure clarity. To support
systematic evaluation, we build MMMC, a benchmark of professionally produced,
discipline-specific educational videos. We evaluate MMMC across diverse
dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and
particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a
VLM, after unlearning, can recover knowledge by watching the generated videos.
Our results demonstrate the potential of Code2Video as a scalable,
interpretable, and controllable approach, achieving 40% improvement over direct
code generation and producing videos comparable to human-crafted tutorials. The
code and datasets are available at https://github.com/showlab/Code2Video.