DiagramBank:一个包含论文元数据的大规模图表设计范例数据集,用于检索增强生成
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation
February 28, 2026
作者: Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan
cs.AI
摘要
近期自主式「AI科學家」系統的發展已展現出自動撰寫科學論文與可執行代碼的能力。然而,生成符合出版級標準的科學圖解(如導引圖)仍是「端到端」論文生成過程中的主要瓶頸。以導引圖為例,其作為戰略性視覺介面,功能有別於衍生數據圖表,需要通過概念整合與規劃將複雜邏輯流程轉化為引導直覺、激發好奇的視覺化圖形。現有AI科學家系統通常忽略此環節,或退而求求其次採用次級替代方案。為彌合這一差距,我們提出DiagramBank——一個大規模數據集,包含從頂級科學出版物中精選的89,422幅示意圖,專為多模態檢索與範例驅動的科學圖表生成而設計。該數據集通過自動化處理流程開發,可提取圖形及對應的文中引用,並採用基於CLIP的過濾器區分示意圖與標準圖表或自然圖像。每個實例均配備從摘要、圖說到圖文引用對的豐富上下文,支持不同粒度的信息檢索。我們以即用型索引格式發布DiagramBank,並提供檢索增強生成代碼庫,展示範例條件下的導引圖合成。DiagramBank已公開於https://huggingface.co/datasets/zhangt20/DiagramBank,代碼庫位於https://github.com/csml-rpi/DiagramBank。
English
Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.