面向可验证的多模态深度研究:一种用于交错式报告生成的多智能体框架
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
May 28, 2026
作者: Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Zhicheng Dou
cs.AI
摘要
大型语言模型(LLMs)推动自主代理从深度搜索(检索简洁的事实性答案)向深度研究(将分散的证据综合成长篇报告)演进。然而,可验证的多模态深度研究仍面临挑战,包括缺乏确定性标准答案的开放式综合需求,以及需要将文本论证与视觉证据交织呈现。我们提出Ptah,一个用于生成交错式报告的多代理框架。Ptah通过规划、研究和撰写阶段,协调从用户查询到渲染网页报告的完整生命周期。在该过程中,专业化代理构建视觉感知计划、收集基于主张的证据、在视觉工作记忆中维护与来源对齐的图像,并通过声明式多模态工具使用编写报告。验证代理作为框架的验收函数,在整个工作流中强制实施事实依据、引文保真度和跨模态一致性。我们还引入PtahEval评估协议,为现有基准增加图像级和呈现级评估。在深度研究基准上的实验表明,Ptah生成的面向人类的多模态报告比强基线更可靠、视觉信息更丰富且更易于使用。
English
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.