TVIR: 텍스트-시각 혼합 보고서 생성을 위한 심층 연구 에이전트 구축

초록

심층 연구 에이전트(Deep Research Agents)는 다단계 정보 검색, 추론 및 장문 보고서 생성에서 강력한 성능을 보여주었지만, 기존의 벤치마크와 시스템은 대부분 텍스트 중심으로 남아 있어 시각적 요소가 사실적으로 신뢰할 수 있고 주변 분석과 잘 정렬되어 있는지에 대한 평가는 제한적이었다. 이러한 격차를 해소하기 위해, 우리는 TVIR(Text–Visual Interleaved Report Generation, 텍스트-시각 혼용 보고서 생성)을 소개한다. TVIR은 TVIR-벤치(TVIR-Bench)와 TVIR-에이전트(TVIR-Agent)를 포함한다. TVIR-벤치는 특정 분석적 하위 목표를 위해 시각적 요소를 필요로 하는 100개의 전문가 선별 다중 모드 심층 연구 과제로 구성된 벤치마크이며, TVIR-에이전트는 개요 작성, 이미지 검색, 출처 추적이 가능한 차트 생성, 그리고 문맥 인식 순차적 작성을 통한 보고서 작성을 위한 강력한 기준선 역할을 하는 계층적 다중 에이전트 프레임워크이다. 또한, 우리는 텍스트 평가(Textual Assessment)와 시각 평가(Visual Assessment)를 결합한 이중 경로 평가 프레임워크를 개발하였다. 9개의 심층 연구 시스템에 걸친 실험 결과, TVIR-에이전트는 전반적으로 강력한 성능을 보여주었으며, 이는 증거 기반 보고서 생성을 위해 명시적인 다중 모드 설계와 평가의 중요성을 강조한다.

English

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.