多模态深度研究器：基于智能框架从零生成图文交织的研究报告

摘要

可视化在有效传达概念和信息方面发挥着关键作用。近期，推理与检索增强生成技术的进步使得大型语言模型（LLMs）能够进行深度研究并生成全面报告。尽管取得了这些进展，现有的深度研究框架主要集中于生成纯文本内容，而文本与可视化自动交织生成的研究仍显不足。这一新颖任务在设计信息丰富的可视化并有效将其与文本报告整合方面提出了关键挑战。为应对这些挑战，我们提出了可视化形式化描述（FDV），一种图表的结构化文本表示方法，使LLMs能够学习并生成多样化的高质量可视化。基于此表示法，我们引入了多模态深度研究框架（Multimodal DeepResearcher），该框架将任务分解为四个阶段：(1) 研究，(2) 范例报告文本化，(3) 规划，以及(4) 多模态报告生成。为了评估生成的多模态报告，我们开发了MultimodalReportBench，包含100个多样化主题作为输入，并配备了5项专用评估指标。跨模型与评估方法的广泛实验验证了Multimodal DeepResearcher的有效性。值得注意的是，在采用相同Claude 3.7 Sonnet模型的情况下，Multimodal DeepResearcher相较于基线方法实现了82%的整体胜率。

English

Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

多模态深度研究器：基于智能框架从零生成图文交织的研究报告

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

摘要

Support