UFO：一種通過開放式語言介面實現精細視覺感知的統一方法

摘要

通用模型在語言和視覺-語言任務中取得了顯著成功，展示了統一建模的潛力。然而，將諸如檢測和分割等細粒度感知任務有效整合到這些模型中仍然是一個重大挑戰。這主要是因為這些任務通常嚴重依賴於任務特定的設計和架構，這可能會使建模過程變得複雜。為了解決這一挑戰，我們提出了\ours，這是一個通過開放式語言介面統一細粒度視覺感知任務的框架。通過將所有感知目標轉化為語言空間，\ours將物體級檢測、像素級分割和圖像級視覺-語言任務統一在單一模型中。此外，我們引入了一種新穎的嵌入檢索方法，該方法僅依賴於語言介面來支持分割任務。我們的框架彌合了細粒度感知與視覺-語言任務之間的差距，顯著簡化了架構設計和訓練策略，同時實現了與具有複雜任務特定設計的方法相當或更優的性能。在五個標準視覺感知數據集上進行多任務訓練後，\ours在COCO實例分割上比之前的最先進通用模型提高了12.3 mAP，在ADE20K語義分割上提高了3.3 mIoU。此外，我們的方法無縫整合了現有的多模態大語言模型（MLLMs），有效地將細粒度感知能力與其高級語言能力結合，從而實現更具挑戰性的任務，如推理分割。代碼和模型將公開提供。

English

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that Unifies Fine-grained visual perception tasks through an Open-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models will be publicly available.

UFO：一種通過開放式語言介面實現精細視覺感知的統一方法

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

摘要

Support