Lumos：通过场景文本识别增强多模态LLM

摘要

我们介绍了 Lumos，这是第一个具有文本理解能力的端到端多模态问答系统。Lumos 的核心是一个场景文本识别（STR）组件，从第一人称视角图像中提取文本，其输出用于增强输入到一个多模态大型语言模型（MM-LLM）。在构建 Lumos 的过程中，我们遇到了许多与 STR 质量、总体延迟和模型推断相关的挑战。在本文中，我们深入探讨了这些挑战，并讨论了用于克服这些障碍的系统架构、设计选择和建模技术。我们还为每个组件提供了全面的评估，展示了高质量和高效性。

English

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

Lumos：通过场景文本识别增强多模态LLM

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

摘要

Support