TVIR：构建面向图文交错报告生成的深度研究智能体

摘要

深度研究代理在多步信息检索、推理和长文报告生成方面展现了强大能力，但现有基准和系统仍以文本为中心，对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为弥补这一空白，我们提出了TVIR（文本-视觉交错报告生成），其中包括TVIR-Bench——一个由100个专家策划的多模态深度研究任务组成的基准，要求视觉元素服务于特定的分析子目标；以及TVIR-Agent——一个分层多代理框架，作为构建大纲、检索图像、生成带有可追溯来源的图表以及通过上下文感知的序列写作撰写报告的强基线。我们进一步开发了双路径评估框架，结合了文本评估和视觉评估。在九个深度研究系统上的实验表明，TVIR-Agent取得了优异的整体性能，凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

English

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.