멀티모달 딥리서처: 에이전트 프레임워크를 활용한 텍스트-차트 혼합 보고서 생성

초록

시각화는 개념과 정보를 효과적으로 전달하는 데 중요한 역할을 합니다. 최근 추론 및 검색 강화 생성 기술의 발전으로 대형 언어 모델(LLMs)이 심층 연구를 수행하고 포괄적인 보고서를 생성할 수 있게 되었습니다. 이러한 진전에도 불구하고, 기존의 심층 연구 프레임워크는 주로 텍스트만으로 구성된 콘텐츠 생성에 초점을 맞추고 있어, 텍스트와 시각화가 결합된 자동 생성은 충분히 탐구되지 않고 있습니다. 이 새로운 과제는 정보를 효과적으로 전달하는 시각화를 설계하고 이를 텍스트 보고서와 효과적으로 통합하는 데 있어 주요한 도전 과제를 제시합니다. 이러한 도전 과제를 해결하기 위해, 우리는 시각화의 구조화된 텍스트 표현인 Formal Description of Visualization (FDV)를 제안합니다. FDV는 LLMs가 다양한 고품질 시각화를 학습하고 생성할 수 있도록 합니다. 이 표현을 기반으로, 우리는 Multimodal DeepResearcher라는 에이전트 기반 프레임워크를 소개합니다. 이 프레임워크는 작업을 네 단계로 분해합니다: (1) 연구, (2) 예시 보고서 텍스트화, (3) 계획, (4) 멀티모달 보고서 생성. 생성된 멀티모달 보고서의 평가를 위해, 우리는 100개의 다양한 주제를 입력으로 포함하고 5개의 전용 메트릭을 갖춘 MultimodalReportBench를 개발했습니다. 다양한 모델과 평가 방법을 통한 광범위한 실험은 Multimodal DeepResearcher의 효과를 입증합니다. 특히, 동일한 Claude 3.7 Sonnet 모델을 사용할 때, Multimodal DeepResearcher는 기준 방법 대비 82%의 전반적인 승률을 달성합니다.

English

Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

멀티모달 딥리서처: 에이전트 프레임워크를 활용한 텍스트-차트 혼합 보고서 생성

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

초록

Support