Lumos:通过场景文本识别增强多模态LLM
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
February 12, 2024
作者: Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao, Longfang Zhao, Ankit Ramchandani, Xin Luna Dong, Anuj Kumar
cs.AI
摘要
我们介绍了 Lumos,这是第一个具有文本理解能力的端到端多模态问答系统。Lumos 的核心是一个场景文本识别(STR)组件,从第一人称视角图像中提取文本,其输出用于增强输入到一个多模态大型语言模型(MM-LLM)。在构建 Lumos 的过程中,我们遇到了许多与 STR 质量、总体延迟和模型推断相关的挑战。在本文中,我们深入探讨了这些挑战,并讨论了用于克服这些障碍的系统架构、设计选择和建模技术。我们还为每个组件提供了全面的评估,展示了高质量和高效性。
English
We introduce Lumos, the first end-to-end multimodal question-answering system
with text understanding capabilities. At the core of Lumos is a Scene Text
Recognition (STR) component that extracts text from first person point-of-view
images, the output of which is used to augment input to a Multimodal Large
Language Model (MM-LLM). While building Lumos, we encountered numerous
challenges related to STR quality, overall latency, and model inference. In
this paper, we delve into those challenges, and discuss the system
architecture, design choices, and modeling techniques employed to overcome
these obstacles. We also provide a comprehensive evaluation for each component,
showcasing high quality and efficiency.Summary
AI-Generated Summary