다중모달 건초더미 속 바늘 찾기: 다중모달 대형 언어 모델의 장문맥 처리 능력 벤치마킹

초록

멀티모달 대형 언어 모델(MLLMs)은 다양한 응용 분야에서 상당한 가능성을 보여주며, 연구자와 실무자 모두의 폭넓은 관심을 받고 있습니다. 그러나 이들의 장문맥(long-context) 능력을 종합적으로 평가하는 연구는 아직 미흡한 상태입니다. 이러한 격차를 해결하기 위해, 우리는 MLLMs의 장문맥 능력을 평가하기 위해 특별히 설계된 멀티모달 Needle-in-a-haystack(MMNeedle) 벤치마크를 소개합니다. 다중 이미지 입력 외에도, 이미지 스티칭(image stitching)을 통해 입력 문맥 길이를 더욱 증가시키고, 하위 이미지 수준 검색을 위한 레이블을 자동으로 생성하는 프로토콜을 개발했습니다. 본질적으로 MMNeedle은 텍스트 지시와 이미지 내용 설명을 기반으로 이미지 집합(haystack) 내에서 목표 하위 이미지(needle)를 찾는 MLLMs의 능력을 스트레스 테스트를 통해 평가합니다. 이 설정은 광범위한 시각적 문맥을 이해하고 장문맥 이미지 입력 내에서 효과적으로 정보를 검색하는 고급 능력을 필요로 합니다. 이 벤치마크를 통해 우리는 API 기반 및 오픈소스 모델을 포함한 최첨단 MLLMs를 평가했습니다. 연구 결과, GPT-4o는 장문맥 시나리오에서 다른 모델들을 지속적으로 능가하지만, 네거티브 샘플(즉, haystack에 needle이 없는 경우)에서 환각(hallucination) 문제를 겪는 것으로 나타났습니다. 또한, 우리의 종합적인 장문맥 평가는 API 기반 모델과 오픈소스 모델 간의 상당한 성능 격차를 밝혀냈습니다. 주요 결과를 재현하는 데 필요한 모든 코드, 데이터 및 지침은 https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

다중모달 건초더미 속 바늘 찾기: 다중모달 대형 언어 모델의 장문맥 처리 능력 벤치마킹

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

초록

Support