Struct-Searcher: Agentisches strukturelles Denken verbessert multimodale tiefgehende Informationssuche

Zusammenfassung

Deep-Research-Agenten haben zunehmend Aufmerksamkeit erregt aufgrund ihrer Fähigkeit, umfangreiche Online-Informationen zu sammeln, um Zielwissen zu erwerben, wobei sich neuere Bemühungen von rein textbasierter Informationssuche hin zu multimodalen Umgebungen verlagern. Bestehende agentische Workflows sind jedoch weitgehend an Evidenzakkumulationsmodelle angelehnt, die Evidenz linear aggregieren und keine prinzipiellen Mechanismen für den Umgang mit widersprüchlichen Informationen über heterogene Modalitäten hinweg bieten. Hierzu schlagen wir Struct-Searcher vor, einen strukturellen agentischen Workflow, der auf der Glaubensrevisionstheorie basiert und explizit einen sich entwickelnden multimodalen Strukturgraphen während des gesamten Reasoning-Prozesses aufrechterhält, wodurch eine effektive konfliktbewusste multimodale Tiefensuche ermöglicht wird. Umfangreiche Experimente mit mehreren Benchmark-Datensätzen und Basis-Modellen zeigen, dass Struct-Searcher (1) plug-and-play und modellagnostisch ist, mit einer durchschnittlichen relativen Genauigkeitssteigerung von 17,2% auf BrowseComp-VL über fünf verschiedene Backbones hinweg; (2) Spitzenleistungen erbringt, indem es konsequent moderne Vision-Language-Modelle (VLMs) und Deep-Research-Agenten übertrifft, mit relativen Genauigkeitssteigerungen von 3,7% auf MM-BrowseComp, 1,5% auf HLE-VL und 0,7% auf BrowseComp-VL gegenüber dem zweitbesten konkurrierenden Ansatz.

English

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.