TheoremExplainAgent: LLM 정리 이해를 위한 다중모달 설명 모델

초록

도메인 특정 정리를 이해하는 데는 단순히 텍스트 기반 추론만으로는 부족하며, 구조화된 시각적 설명을 통한 효과적인 커뮤니케이션이 더 깊은 이해를 위해 필수적입니다. 대규모 언어 모델(LLM)은 텍스트 기반 정리 추론에서 강력한 성능을 보이지만, 일관적이고 교육적으로 의미 있는 시각적 설명을 생성하는 능력은 여전히 해결되지 않은 과제로 남아 있습니다. 본 연구에서는 Manim 애니메이션을 사용하여 5분 이상의 장편 정리 설명 동영상을 생성하는 에이전트 기반 접근법인 TheoremExplainAgent를 소개합니다. 다중 모드 정리 설명을 체계적으로 평가하기 위해, 우리는 여러 STEM 분야에 걸친 240개의 정리를 포함하고 5개의 자동화된 평가 지표를 갖춘 TheoremExplainBench 벤치마크를 제안합니다. 우리의 결과는 에이전트 기반 계획이 상세한 장편 동영상 생성에 필수적이며, o3-mini 에이전트가 93.8%의 성공률과 0.77의 종합 점수를 달성함을 보여줍니다. 그러나 정량적 및 정성적 연구 결과, 생성된 대부분의 동영상에서 시각적 요소 배치에 사소한 문제가 있음이 드러났습니다. 또한, 다중 모드 설명은 텍스트 기반 설명이 드러내지 못한 더 깊은 추론 결함을 노출시켜, 다중 모드 설명의 중요성을 강조합니다.

English

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.