從網絡到像素：將代理式搜尋引入視覺感知

摘要

視覺感知將高層語義理解與像素級感知連接起來，但大多數現有設定假設辨識目標的決定性證據已存在於影像中或凍結的模型知識中。我們研究一個更具實用性但更困難的開放世界案例：在定位可視物體之前，必須先從外部事實、近期事件、長尾實體或多跳關係中解析出該物體。我們將此挑戰形式化為感知深度研究，並提出WebEye——一個以物件為錨點的基準，包含可驗證證據、知識密集型查詢、精確的邊界框/遮罩標註，以及三種任務視角：基於搜尋的定位、基於搜尋的分割與基於搜尋的VQA。WebEye包含120張影像、473個已標註的物件實例、645個獨立的問答對，以及1,927個任務樣本。我們進一步提出Pixel-Searcher，一種代理式搜尋到像素工作流程，能解析隱藏目標身份並將其綁定到邊界框、遮罩或精確答案。實驗表明，Pixel-Searcher在所有三種任務視角中均達到最強的開源性能，而失敗主要源於證據獲取、身份解析與視覺實例綁定環節。

English

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

從網絡到像素：將代理式搜尋引入視覺感知

From Web to Pixels: Bringing Agentic Search into Visual Perception

摘要

Support