DiagramBank:面向检索增强生成的大规模图表设计范例数据集及其论文元数据库
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation
February 28, 2026
作者: Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan
cs.AI
摘要
近期,自主"AI科学家"系统的进展已展现出自动撰写科学论文与可执行代码的能力。然而,生成达到发表水平的科学示意图(如导览图)仍是"端到端"论文生成过程中的主要瓶颈。导览图作为战略性视觉界面,其功能不同于衍生的数据图表,它要求通过概念整合与规划,将复杂逻辑工作流转化为能够引导直觉、激发好奇力的图示。现有AI科学家系统通常忽略该环节,或采用次优替代方案。为弥补这一空白,我们推出DiagramBank——一个从顶级科学出版物中精选88,422幅示意图构建的大规模数据集,专为多模态检索与范例驱动的科学图表生成而设计。该数据集通过自动化处理流程开发,可提取图表及对应文中引用,并采用基于CLIP的过滤器区分示意图与标准图表或自然图像。每个实例均配有从摘要、图注到图文引用对的丰富上下文,支持不同粒度查询下的信息检索。我们以即用型索引格式发布DiagramBank,并提供检索增强生成代码库,展示基于范例条件的导览图合成。DiagramBank数据集公开于https://huggingface.co/datasets/zhangt20/DiagramBank,代码库位于https://github.com/csml-rpi/DiagramBank。
English
Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.