定理解釋代理：面向大語言模型定理理解的多模態解釋

摘要

理解領域特定定理往往不僅需要基於文本的推理；通過結構化的視覺解釋進行有效溝通對於深入理解至關重要。儘管大型語言模型（LLMs）在基於文本的定理推理中表現出色，但其生成連貫且具有教學意義的視覺解釋的能力仍是一個未解決的挑戰。在本研究中，我們介紹了TheoremExplainAgent，這是一種利用Manim動畫生成長篇定理解釋視頻（超過5分鐘）的代理方法。為了系統評估多模態定理解釋，我們提出了TheoremExplainBench，這是一個涵蓋多個STEM學科240個定理的基準，並配備了5個自動化評估指標。我們的結果表明，代理規劃對於生成詳細的長篇視頻至關重要，而o3-mini代理的成功率達到了93.8%，總分為0.77。然而，我們的定量和定性研究顯示，大多數生成的視頻在視覺元素布局上存在小問題。此外，多模態解釋揭示了基於文本的解釋未能揭示的更深層次的推理缺陷，凸顯了多模態解釋的重要性。

English

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

定理解釋代理：面向大語言模型定理理解的多模態解釋

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

摘要

Support