ChatPaper.aiChatPaper

Griffon v2:通过高分辨率缩放和视觉-语言共指推进多模态感知

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

March 14, 2024
作者: Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
cs.AI

摘要

大型视觉语言模型已经实现了细粒度对象感知,但图像分辨率的限制仍然是超越特定任务专家在复杂和密集场景中表现的重要障碍。这种限制进一步限制了模型在诸如GUI代理、计数等领域实现细微的视觉和语言引用的潜力。为了解决这个问题,我们引入了一个统一的高分辨率通用模型,Griffon v2,实现了灵活的对象引用,可通过视觉和文本提示。为了有效地提高图像分辨率,我们设计了一个简单且轻量级的下采样投影器,以克服大型语言模型中输入令牌的限制。这种设计固有地保留了完整的上下文和细节,并显著提高了多模态感知能力,特别是对于小对象。基于此,我们进一步为模型配备了通过即插即用的视觉标记器实现视觉-语言共指能力。它实现了与灵活目标图像、自由形式文本甚至坐标的用户友好交互。实验证明,Griffon v2可以通过视觉和文本引用定位任何感兴趣的对象,在REC、短语定位和REG任务上实现了最先进的性能,并在对象检测和对象计数方面胜过专家模型。数据、代码和模型将在https://github.com/jefferyZhan/Griffon 上发布。
English
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Summary

AI-Generated Summary

PDF163December 15, 2024