ChatPaper.aiChatPaper

OMG-LLaVA:连接图像级、对象级、像素级推理与理解

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

June 27, 2024
作者: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
cs.AI

摘要

当前的通用分割方法展现出在像素级图像和视频理解方面的强大能力。然而,它们缺乏推理能力,无法通过文本指令进行控制。相比之下,大型视觉-语言多模态模型展示出强大的基于视觉的对话和推理能力,但缺乏像素级理解,并且难以接受用于灵活用户交互的视觉提示。本文提出了OMG-LLaVA,一个新颖而优雅的框架,将强大的像素级视觉理解与推理能力相结合。它可以接受各种视觉和文本提示,实现灵活的用户交互。具体而言,我们使用通用分割方法作为视觉编码器,将图像信息、感知先验和视觉提示整合为提供给LLM的视觉记号。LLM负责理解用户的文本指令,并根据视觉信息提供文本响应和像素级分割结果。我们提出感知先验嵌入以更好地整合感知先验与图像特征。OMG-LLaVA在单一模型中实现了图像级、对象级和像素级的推理和理解,与多个基准测试上专门方法的性能相匹敌甚至超越。我们的工作不是使用LLM连接每个专家,而是旨在对一个编码器、一个解码器和一个LLM进行端到端训练。代码和模型已发布供进一步研究使用。
English
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

Summary

AI-Generated Summary

PDF5510November 29, 2024