OMG-LLaVA：連接圖像級別、物體級別、像素級別的推理和理解

摘要

目前的通用分割方法在像素級圖像和視頻理解方面表現出強大能力。然而，它們缺乏推理能力，無法通過文本指令進行控制。相比之下，大型視覺語言多模型展現出強大的基於視覺的對話和推理能力，但缺乏像素級理解，並且難以接受用於靈活用戶互動的視覺提示。本文提出了OMG-LLaVA，一個新穎而優雅的框架，結合了強大的像素級視覺理解和推理能力。它可以接受各種視覺和文本提示，以進行靈活的用戶互動。具體來說，我們使用通用分割方法作為視覺編碼器，將圖像信息、感知先驗和視覺提示整合為提供給LLM的視覺標記。LLM負責理解用戶的文本指令，並根據視覺信息提供文本回應和像素級分割結果。我們提出感知先驗嵌入以更好地將感知先驗與圖像特徵整合。OMG-LLaVA在單一模型中實現了圖像級、對象級和像素級的推理和理解，與多個基準測試上的專門方法的性能相匹敵甚至超越。我們的工作不是使用LLM來連接每個專家，而是旨在對一個編碼器、一個解碼器和一個LLM進行端到端訓練。代碼和模型已釋出供進一步研究使用。

English

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

OMG-LLaVA：連接圖像級別、物體級別、像素級別的推理和理解

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

摘要

Support