ChatPaper.aiChatPaper

BOOM:超越单一模态的KIT多模态多语言讲座伴侣

BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

December 2, 2025
作者: Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues
cs.AI

摘要

教育全球化与在线学习的迅猛发展使得教育内容本地化成为关键挑战。讲座材料本质上是多模态的,结合了语音音频与视觉幻灯片,这要求系统具备处理多种输入模态的能力。为提供无障碍的完整学习体验,译文必须保留所有模态:可阅读的文本、辅助视觉理解的幻灯片以及适于听觉学习的语音。我们推出BOOM——一种多模态多语言讲座伴侣系统,它能联合翻译讲座音频与幻灯片,生成跨三种模态的同步输出:翻译文本、保留视觉元素的本地化幻灯片以及合成语音。这种端到端的方法使学生能以母语获取讲座内容,同时力求完整保留原始资料。实验表明,具备幻灯片感知的转录文本还能为摘要生成和问答等下游任务带来连锁增益。我们在https://github.com/saikoneru/image-translator 发布幻灯片翻译代码,并将其集成至讲座翻译系统https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline(注:所有公开代码与模型均采用MIT许可证)。
English
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.
PDF01December 4, 2025