Griffon v2: 高解像度スケーリングと視覚-言語共参照によるマルチモーダル知覚の進化

要旨

大規模視覚言語モデルは細粒度の物体認識を実現してきたが、画像解像度の制限は、複雑で密集したシナリオにおいてタスク特化型の専門家の性能を超える上で依然として大きな障壁となっている。この制限は、GUIエージェントやカウントなどの領域における微妙な視覚と言語の参照能力をモデルが発揮する可能性をさらに制約している。この問題に対処するため、我々は統一された高解像度汎用モデルであるGriffon v2を導入し、視覚的およびテキスト的なプロンプトを用いた柔軟な物体参照を可能にした。画像解像度を効率的にスケールアップするために、大規模言語モデルの入力トークン制約を克服するシンプルで軽量なダウンサンプリングプロジェクタを設計した。この設計は、完全なコンテキストと細部を本質的に保持し、特に小さな物体に対するマルチモーダル認識能力を大幅に向上させる。これを基盤として、プラグアンドプレイの視覚トークナイザーを通じて、モデルに視覚言語共参照能力をさらに装備した。これにより、ユーザーフレンドリーなインタラクションが可能となり、柔軟なターゲット画像、自由形式のテキスト、さらには座標さえも使用できる。実験結果は、Griffon v2が視覚的およびテキスト的な参照を用いて関心のある任意の物体をローカライズし、REC、フレーズグラウンディング、REGタスクにおいて最先端の性能を達成し、物体検出と物体カウントにおいて専門家モデルを上回ることを示している。データ、コード、モデルはhttps://github.com/jefferyZhan/Griffonで公開される予定である。

English

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Griffon v2: 高解像度スケーリングと視覚-言語共参照によるマルチモーダル知覚の進化

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

要旨

Support