从网络到像素：将自主搜索融入视觉感知

摘要

视觉感知将高层语义理解与像素级感知联系起来，但大多数现有设定假设识别目标的决定性证据已经存在于图像或冻结的模型知识中。我们研究了一个更实际但也更困难的开放世界案例，其中可见对象必须先通过外部事实、近期事件、长尾实体或多跳关系进行解析，然后才能被定位。我们将这一挑战形式化为感知深度研究，并引入WebEye，这是一个以对象为锚点的基准，包含可验证证据、知识密集型查询、精确的框/掩码标注，以及三个任务视图：基于搜索的定位、基于搜索的分割和基于搜索的VQA。WebEyes包含120张图像、473个已标注对象实例、645个独特的问答对和1,927个任务样本。我们进一步提出了Pixel-Searcher，这是一种代理式搜索到像素的工作流程，用于解析隐藏的目标身份并将其绑定到框、掩码或基于事实的答案上。实验表明，Pixel-Searcher在所有三个任务视图上均取得了最强的开源性能，而失败主要源于证据获取、身份解析和视觉实例绑定。

English

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

从网络到像素：将自主搜索融入视觉感知

From Web to Pixels: Bringing Agentic Search into Visual Perception

摘要

Support