OMG-LLaVA: 이미지 수준, 객체 수준, 픽셀 수준의 추론과 이해를 연결하다

초록

현재의 범용 세그멘테이션 방법들은 픽셀 수준의 이미지 및 비디오 이해에서 강력한 능력을 보여줍니다. 그러나 이러한 방법들은 추론 능력이 부족하며 텍스트 지시를 통해 제어할 수 없습니다. 반면, 대규모 시각-언어 다중모달 모델들은 강력한 시각 기반 대화 및 추론 능력을 가지고 있지만 픽셀 수준의 이해가 부족하고 유연한 사용자 상호작용을 위한 시각적 프롬프트를 받아들이는 데 어려움을 겪습니다. 본 논문은 강력한 픽셀 수준의 시각 이해와 추론 능력을 결합한 새로운 우아한 프레임워크인 OMG-LLaVA를 제안합니다. 이 프레임워크는 다양한 시각 및 텍스트 프롬프트를 받아들여 유연한 사용자 상호작용이 가능합니다. 구체적으로, 우리는 범용 세그멘테이션 방법을 시각 인코더로 사용하여 이미지 정보, 인지 사전 지식, 그리고 시각적 프롬프트를 LLM에 제공되는 시각 토큰으로 통합합니다. LLM은 사용자의 텍스트 지시를 이해하고 시각 정보를 기반으로 텍스트 응답과 픽셀 수준의 세그멘테이션 결과를 제공하는 역할을 담당합니다. 우리는 인지 사전 지식을 이미지 특징과 더 잘 통합하기 위해 인지 사전 지식 임베딩을 제안합니다. OMG-LLaVA는 단일 모델에서 이미지 수준, 객체 수준, 그리고 픽셀 수준의 추론과 이해를 달성하며, 여러 벤치마크에서 특화된 방법들의 성능을 능가하거나 그에 맞먹습니다. 각 전문가 모델을 연결하기 위해 LLM을 사용하는 대신, 우리의 작업은 하나의 인코더, 하나의 디코더, 그리고 하나의 LLM에 대한 종단 간 학습을 목표로 합니다. 코드와 모델은 추가 연구를 위해 공개되었습니다.

English

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

OMG-LLaVA: 이미지 수준, 객체 수준, 픽셀 수준의 추론과 이해를 연결하다

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

초록

Support