TVIR：テキストとビジュアルが混在したレポート生成のための深層研究エージェントの構築

要旨

Deep Research Agentsは、多段階の情報検索、推論、長文レポート生成において高い能力を示しているが、既存のベンチマークやシステムは主にテキスト中心であり、視覚的要素が事実として信頼でき、周囲の分析と適切に整合しているかどうかの評価は限られている。このギャップを埋めるため、我々はTVIR（テキスト-ビジュアル交互配置レポート生成）を導入する。TVIRは、視覚的要素が特定の分析サブ目標を果たすことを要求する、専門家が厳選した100のマルチモーダル深層研究タスクからなるベンチマークであるTVIR-Benchと、アウトラインの構築、画像の検索、トレース可能なソースを持つグラフの生成、および文脈認識型の逐次的な文章作成によるレポートの構成を行うための強力なベースラインとして機能する階層的マルチエージェントフレームワークであるTVIR-Agentを含む。さらに、テキスト評価と視覚評価を組み合わせた二経路評価フレームワークを開発する。9つの深層研究システムにわたる実験により、TVIR-Agentは全体的に高い性能を達成し、エビデンス駆動型レポート生成における明示的なマルチモーダル設計と評価の重要性が強調された。

English

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.