Code2Video: 教育用ビデオ生成のためのコード中心パラダイム

要旨

最近の生成モデルはピクセル空間でのビデオ合成を進化させていますが、専門的な教育用ビデオの作成には限界があります。教育用ビデオでは、分野固有の知識、正確な視覚的構造、一貫した遷移が求められるため、教育シナリオでの適用性が制限されています。直感的に、これらの要件はレンダリング可能な環境の操作を通じてより適切に対処できます。この環境は、論理的なコマンド（例：コード）によって明示的に制御可能です。本研究では、実行可能なPythonコードを介して教育用ビデオを生成するためのコード中心のエージェントフレームワーク「Code2Video」を提案します。このフレームワークは、以下の3つの協調エージェントで構成されます：(i) Planner（プランナー）は、講義内容を時間的に一貫した流れに構造化し、対応する視覚的アセットを準備します；(ii) Coder（コーダー）は、構造化された指示を実行可能なPythonコードに変換し、スコープガイド付きの自動修正を組み込むことで効率を向上させます；(iii) Critic（クリティック）は、視覚言語モデル（VLM）と視覚的アンカープロンプトを活用して、空間レイアウトを洗練させ、明瞭さを確保します。体系的な評価を支援するため、専門的に制作された分野固有の教育用ビデオのベンチマーク「MMMC」を構築しました。MMMCを多様な次元で評価し、VLM-as-a-Judgeの美的スコア、コード効率、特に「TeachQuiz」という新しいエンドツーエンドの指標を用いました。TeachQuizは、VLMが生成されたビデオを視聴した後に知識を回復できるかを定量化するものです。結果は、Code2Videoがスケーラブルで解釈可能かつ制御可能なアプローチとしての潜在能力を示し、直接的なコード生成よりも40%の改善を達成し、人間が作成したチュートリアルに匹敵するビデオを生成しました。コードとデータセットはhttps://github.com/showlab/Code2Videoで公開されています。

English

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

Code2Video: 教育用ビデオ生成のためのコード中心パラダイム

Code2Video: A Code-centric Paradigm for Educational Video Generation

要旨

Support