ChatPaper.aiChatPaper

BOOM:超越单一模态的KIT多模态多语言讲座伴侣

BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

December 2, 2025
作者: Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues
cs.AI

摘要

教育全球化與線上學習的快速發展,使教育內容在地化成為關鍵挑戰。講座教材本質上屬於多模態形式,結合了口語音頻與視覺投影片,這要求系統具備處理多種輸入模態的能力。為提供無障礙且完整的學習體驗,翻譯必須保留所有模態:可閱讀的文本、輔助視覺理解的投影片,以及聽覺學習所需的語音。我們提出BOOM——一款多模態多語言講座輔助系統,能同步翻譯講座音頻與投影片,生成三種模態的協同輸出:翻譯文本、保留視覺元素的本地化投影片,以及合成語音。此端到端方法使學生能以母語獲取講座內容,同時力求完整保留原始內容。實驗表明,具備投影片感知的轉錄文本還能為摘要生成和問答等下遊任務帶來連鎖效益。我們已發布投影片翻譯代碼於https://github.com/saikoneru/image-translator,並將其整合至講座翻譯系統Lecture Translator中(https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline)\footnote{所有發布代碼與模型均採用MIT許可證授權。}
English
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.
PDF01December 4, 2025