Lumos: 장면 텍스트 인식을 통해 멀티모달 LLM의 역량 강화

초록

우리는 텍스트 이해 능력을 갖춘 최초의 종단간(end-to-end) 멀티모달 질의응답 시스템인 Lumos를 소개한다. Lumos의 핵심에는 1인칭 시점 이미지에서 텍스트를 추출하는 장면 텍스트 인식(Scene Text Recognition, STR) 컴포넌트가 있으며, 이 출력은 멀티모달 대형 언어 모델(Multimodal Large Language Model, MM-LLM)의 입력을 보강하는 데 사용된다. Lumos를 구축하는 과정에서 우리는 STR 품질, 전체 지연 시간, 모델 추론과 관련된 수많은 도전 과제에 직면했다. 본 논문에서는 이러한 도전 과제를 깊이 있게 탐구하고, 이러한 장애물을 극복하기 위해 채택한 시스템 아키텍처, 설계 선택, 모델링 기법에 대해 논의한다. 또한 각 컴포넌트에 대한 포괄적인 평가를 제공하여 높은 품질과 효율성을 입증한다.

English

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

Lumos: 장면 텍스트 인식을 통해 멀티모달 LLM의 역량 강화

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

초록

Support