Griffon v2：透過高解析度縮放和視覺-語言共指推進多模式感知。

摘要

大型視覺語言模型已實現精細的物件感知，但圖像解析度的限制仍然是超越在複雜和密集情境中表現出色的特定任務專家的重要障礙。這種限制進一步限制了模型在諸如GUI代理、計數等領域實現細緻的視覺和語言參照的潛力。為了解決這個問題，我們引入了一個統一的高解析度通用模型，Griffon v2，實現了具有視覺和文本提示的靈活物件參照。為了有效地提高圖像解析度，我們設計了一個簡單且輕量級的下採樣投影器，以克服大型語言模型中輸入令牌的限制。這種設計固有地保留了完整的上下文和細節，並顯著提高了多模態感知能力，特別是對於小物件。在此基礘上，我們進一步通過一個即插即用的視覺標記器為模型配備了視覺-語言共參照能力。它實現了與靈活目標圖像、自由格式文本甚至座標的用戶友好互動。實驗表明，Griffon v2能夠定位任何感興趣的物件並進行視覺和文本參照，實現了REC、短語定位和REG任務的最新性能，並在物件檢測和物件計數方面勝過專家模型。數據、代碼和模型將在https://github.com/jefferyZhan/Griffon 上發布。

English

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Griffon v2：透過高解析度縮放和視覺-語言共指推進多模式感知。

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

摘要

Support