Visual-Seeker：透過主動視覺推理實現視覺原生多模態主體搜尋

摘要

多模態大型語言模型在許多視覺任務中展現出令人印象深刻的能力，但面對複雜的開放世界場景時，往往難以實現事實性根基。儘管近期多模態深度搜尋代理嘗試透過利用外部工具來解決此問題，但視覺原生的搜尋典範仍未被充分探索。現有方法主要依賴具有明確語意的簡單圖像及純文字推論軌跡，限制了代理進行多跳、跨模態推理與搜尋的能力。為應對這些限制，我們提出Visual-Seeker——一個透過主動視覺推理實現視覺原生多模態深度搜尋的代理。我們的代理並非將視覺視為靜態輸入，而是主動關注細粒度視覺細節，在搜尋過程中動態擷取視覺證據。為釋放其視覺原生潛能，我們設計了一套主動視覺推理資料管線，並合成5000條高品質多模態軌跡以進行模型訓練。廣泛的實驗結果顯示，該方法在五項具挑戰性的多模態搜尋基準測試中達到了最先進的效能，甚至超越了數個專有模型，驗證了其在真實網路環境中具備穩健的視覺原生推理與搜尋能力。程式碼與資料可於以下網址取得：https://github.com/ZhengboZhang/Visual-Seeker。

English

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.