マルチモーダルDeepResearcher: エージェントフレームワークによるテキストとチャートを織り交ぜたレポートのゼロからの生成

要旨

可視化は、概念や情報を効果的に伝達する上で重要な役割を果たします。近年の推論と検索拡張生成の進歩により、大規模言語モデル（LLMs）は深いリサーチを行い、包括的なレポートを生成できるようになりました。しかし、その進歩にもかかわらず、既存の深いリサーチフレームワークは主にテキストのみのコンテンツ生成に焦点を当てており、テキストと可視化を交互に組み合わせた自動生成は十分に探求されていません。この新しいタスクは、情報量の多い可視化を設計し、それらをテキストレポートと効果的に統合する上で重要な課題を提起します。これらの課題に対処するため、我々は「Formal Description of Visualization（FDV）」を提案します。これは、チャートの構造化されたテキスト表現であり、LLMsが多様で高品質な可視化を学習し生成することを可能にします。この表現を基に、我々は「Multimodal DeepResearcher」を導入します。これは、タスクを4つの段階に分解するエージェント型フレームワークです：（1）リサーチ、（2）模範レポートのテキスト化、（3）計画、（4）マルチモーダルレポート生成。生成されたマルチモーダルレポートの評価のために、我々は「MultimodalReportBench」を開発しました。これは、100の多様なトピックを入力として含み、5つの専用メトリクスを備えています。モデルと評価方法にわたる広範な実験により、Multimodal DeepResearcherの有効性が実証されました。特に、同じClaude 3.7 Sonnetモデルを使用した場合、Multimodal DeepResearcherはベースライン手法に対して82％の総合勝率を達成しました。

English

Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

マルチモーダルDeepResearcher: エージェントフレームワークによるテキストとチャートを織り交ぜたレポートのゼロからの生成

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

要旨

Support