Lumos：シーンテキスト認識によるマルチモーダルLLMの強化

要旨

本論文では、テキスト理解能力を備えた初のエンドツーエンドマルチモーダル質問応答システム「Lumos」を紹介する。Lumosの中核には、一人称視点画像からテキストを抽出するシーンテキスト認識（STR）コンポーネントがあり、その出力はマルチモーダル大規模言語モデル（MM-LLM）への入力を強化するために使用される。Lumosの構築において、我々はSTRの品質、全体的なレイテンシ、モデル推論に関連する数多くの課題に直面した。本論文では、これらの課題に深く踏み込み、それらを克服するために採用したシステムアーキテクチャ、設計上の選択、モデリング技術について議論する。また、各コンポーネントに対する包括的な評価を提供し、高い品質と効率性を示す。

English

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

Lumos：シーンテキスト認識によるマルチモーダルLLMの強化

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

要旨

Support