멀티모달 대형 언어 모델에 시각 인식 토큰 도입하기

초록

시각 정보를 활용하기 위해 멀티모달 대형 언어 모델(MLLM)은 시각 인코더의 인지 프로세스에 의존합니다. 시각적 인지의 완전성과 정확성은 공간 추론, 세밀한 이해 및 기타 작업의 정밀도에 큰 영향을 미칩니다. 그러나 MLLM은 여전히 자율적으로 시각적 인지 프로세스를 제어할 수 있는 능력이 부족합니다. 예를 들어, 이미지의 특정 영역을 선택적으로 검토하거나 특정 객체 카테고리와 관련된 정보에 집중하는 등의 작업이 어렵습니다. 본 연구에서는 MLLM이 시각적 인지 프로세스를 제어할 수 있는 메커니즘을 제공하기 위해 '시각적 인지 토큰(Visual Perception Token)'이라는 개념을 제안합니다. 우리는 '영역 선택 토큰(Region Selection Token)'과 '시각 재인코딩 토큰(Vision Re-Encoding Token)'이라는 두 가지 유형의 시각적 인지 토큰을 설계했습니다. MLLM은 텍스트를 생성하는 것과 마찬가지로 이러한 토큰을 자율적으로 생성하고, 이를 통해 추가적인 시각적 인지 작업을 트리거합니다. 영역 선택 토큰은 이미지 내에서 추가적인 인지가 필요한 특정 영역을 명시적으로 식별하며, 시각 재인코딩 토큰은 은닉 상태(hidden states)를 제어 신호로 사용하여 추가적인 시각적 인지 프로세스를 안내합니다. 광범위한 실험을 통해 이러한 토큰이 공간 추론 처리, 세밀한 이해 개선 및 기타 작업에서의 장점을 입증했습니다. 평균적으로, 시각적 인지 토큰의 도입은 20억 파라미터 모델의 성능을 23.6% 향상시켜 점수를 0.572에서 0.708로 끌어올렸으며, 심지어 70억 파라미터 모델을 13.4%(0.624 기준) 앞질렀습니다. 자세한 내용은 https://github.com/yu-rp/VisualPerceptionToken에서 확인하실 수 있습니다.

English

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

멀티모달 대형 언어 모델에 시각 인식 토큰 도입하기

Introducing Visual Perception Token into Multimodal Large Language Model

초록

Support