邁向可驗證的多模態深度研究：用於交錯式報告生成的多智能體框架

摘要

大型语言模型（LLMs）已将自主代理从深度搜索——检索简洁事实性答案——推进至深度研究——将零散证据综合成长篇报告。然而，可验证的多模态深度研究仍面临挑战，原因在于开放式综合缺乏确定性事实基准，且需要将文本论证与视觉证据交错融合。我们提出 Ptah，一个用于生成交错式报告的多智能体框架。Ptah 通过规划、研究与写作阶段，协调从用户查询到渲染网页报告的完整生命周期：专业智能体构建视觉感知规划、收集基于主张的证据、在视觉工作记忆中维护源对齐图像，并通过声明式多模态工具使用撰写报告。验证智能体作为框架的接受函数，在整个工作流中强制实施事实依据、引用忠实性及跨模态一致性。我们进一步引入 PtahEval，一个评估协议，在现有基准测试基础上增加图像级与展示级评估。在深度研究基准上的实验表明，相较于强基线模型，Ptah 生成的面向人类用户的多模态报告更可靠、视觉信息更丰富且更实用。

English

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.