Code2Video：以代码为核心的教育视频生成范式

摘要

尽管近期的生成模型在像素级视频合成方面取得了进展，但在制作专业教育视频方面仍显不足，这类视频要求具备学科知识、精确的视觉结构以及连贯的转场，从而限制了其在教育场景中的应用。直观而言，这些需求更适合通过可渲染环境的操控来满足，该环境可通过逻辑命令（如代码）进行明确控制。在本研究中，我们提出了Code2Video，一个以代码为核心的代理框架，旨在通过可执行的Python代码生成教育视频。该框架包含三个协作代理：（i）规划者（Planner），负责将讲座内容组织成时间上连贯的流程，并准备相应的视觉素材；（ii）编码者（Coder），将结构化指令转换为可执行的Python代码，同时引入范围引导的自动修复机制以提高效率；以及（iii）评审者（Critic），利用视觉语言模型（VLM）结合视觉锚点提示，优化空间布局并确保清晰度。为支持系统性评估，我们构建了MMMC，一个由专业制作、学科特定的教育视频组成的基准。我们从多个维度对MMMC进行评估，包括VLM作为评判者的美学评分、代码效率，特别是TeachQuiz，这一新颖的端到端指标量化了VLM在“去学习”后，通过观看生成视频恢复知识的能力。我们的结果表明，Code2Video作为一种可扩展、可解释且可控的方法具有巨大潜力，相较于直接代码生成提升了40%的效率，并制作出可与人工教程相媲美的视频。代码及数据集可在https://github.com/showlab/Code2Video获取。

English

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

Code2Video：以代码为核心的教育视频生成范式

Code2Video: A Code-centric Paradigm for Educational Video Generation

摘要

Support