TVIR：構建面向文本-視覺交錯報告生成的深度研究代理

摘要

深度研究代理在多步驟資訊檢索、推理與長篇報告生成方面展現出強大能力，然而現有基準測試與系統仍以文字為核心，對視覺元素在事實可靠性及與周邊分析的一致性方面缺乏充分評估。為填補此缺口，我們提出TVIR（文字-視覺交錯報告生成），其包含TVIR-Bench——一個由100項專家策劃的多模態深度研究任務所組成的基準測試，要求視覺元素服務於特定的分析子目標；以及TVIR-Agent——一個階層式多重代理框架，作為建構大綱、檢索圖片、生成可追溯來源的圖表，並透過情境感知序列寫作來組成報告的強力基線。我們進一步開發雙路徑評估框架，結合「文字評估」與「視覺評估」。在九個深度研究系統上的實驗顯示，TVIR-Agent達到整體優異表現，凸顯了明確的多模態設計與評估對證據驅動的報告生成至關重要。

English

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.