OMG-LLaVA:連接圖像級別、物體級別、像素級別的推理和理解
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
June 27, 2024
作者: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
cs.AI
摘要
目前的通用分割方法在像素級圖像和視頻理解方面表現出強大能力。然而,它們缺乏推理能力,無法通過文本指令進行控制。相比之下,大型視覺語言多模型展現出強大的基於視覺的對話和推理能力,但缺乏像素級理解,並且難以接受用於靈活用戶互動的視覺提示。本文提出了OMG-LLaVA,一個新穎而優雅的框架,結合了強大的像素級視覺理解和推理能力。它可以接受各種視覺和文本提示,以進行靈活的用戶互動。具體來說,我們使用通用分割方法作為視覺編碼器,將圖像信息、感知先驗和視覺提示整合為提供給LLM的視覺標記。LLM負責理解用戶的文本指令,並根據視覺信息提供文本回應和像素級分割結果。我們提出感知先驗嵌入以更好地將感知先驗與圖像特徵整合。OMG-LLaVA在單一模型中實現了圖像級、對象級和像素級的推理和理解,與多個基準測試上的專門方法的性能相匹敵甚至超越。我們的工作不是使用LLM來連接每個專家,而是旨在對一個編碼器、一個解碼器和一個LLM進行端到端訓練。代碼和模型已釋出供進一步研究使用。
English
Current universal segmentation methods demonstrate strong capabilities in
pixel-level image and video understanding. However, they lack reasoning
abilities and cannot be controlled via text instructions. In contrast, large
vision-language multimodal models exhibit powerful vision-based conversation
and reasoning capabilities but lack pixel-level understanding and have
difficulty accepting visual prompts for flexible user interaction. This paper
proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level
vision understanding with reasoning abilities. It can accept various visual and
text prompts for flexible user interaction. Specifically, we use a universal
segmentation method as the visual encoder, integrating image information,
perception priors, and visual prompts into visual tokens provided to the LLM.
The LLM is responsible for understanding the user's text instructions and
providing text responses and pixel-level segmentation results based on the
visual information. We propose perception prior embedding to better integrate
perception priors with image features. OMG-LLaVA achieves image-level,
object-level, and pixel-level reasoning and understanding in a single model,
matching or surpassing the performance of specialized methods on multiple
benchmarks. Rather than using LLM to connect each specialist, our work aims at
end-to-end training on one encoder, one decoder, and one LLM. The code and
model have been released for further research.Summary
AI-Generated Summary