TVIR:構建面向文本-視覺交錯報告生成的深度研究代理
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
June 1, 2026
作者: Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu
cs.AI
摘要
深度研究代理在多步驟資訊檢索、推理與長篇報告生成方面展現出強大能力,然而現有基準測試與系統仍以文字為核心,對視覺元素在事實可靠性及與周邊分析的一致性方面缺乏充分評估。為填補此缺口,我們提出TVIR(文字-視覺交錯報告生成),其包含TVIR-Bench——一個由100項專家策劃的多模態深度研究任務所組成的基準測試,要求視覺元素服務於特定的分析子目標;以及TVIR-Agent——一個階層式多重代理框架,作為建構大綱、檢索圖片、生成可追溯來源的圖表,並透過情境感知序列寫作來組成報告的強力基線。我們進一步開發雙路徑評估框架,結合「文字評估」與「視覺評估」。在九個深度研究系統上的實驗顯示,TVIR-Agent達到整體優異表現,凸顯了明確的多模態設計與評估對證據驅動的報告生成至關重要。
English
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.