SridBench：科研插图绘制图像生成模型基准测试

摘要

近年來，AI驅動的圖像生成技術取得了快速進展。早期的擴散模型注重感知質量，而如GPT-4o-image等新型多模態模型則整合了高層次推理，提升了語義理解與結構構圖能力。科學插圖生成便是這一演進的典例：與一般圖像合成不同，它要求精確解讀技術內容，並將抽象概念轉化為清晰、標準化的視覺呈現。此項任務顯著更具知識密集性與繁瑣性，往往需要耗費數小時的手動操作與專業工具。若能以可控且智能的方式實現其自動化，將帶來巨大的實用價值。然而，目前尚無基準可用於評估AI在此領域的表現。為填補這一空白，我們推出了SridBench，首個專注於科學圖表生成的基準測試。它包含1,120個案例，精選自13個自然科學與計算機科學領域的頂尖學術論文，由人類專家與多模態大語言模型共同收集。每個樣本均從語義忠實度與結構準確性等六個維度進行評估。實驗結果顯示，即便是如GPT-4o-image這樣的頂尖模型，在文本/視覺清晰度及科學正確性方面仍落後於人類表現。這些發現凸顯了對更先進的推理驅動視覺生成能力的迫切需求。

English

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.

SridBench：科研插图绘制图像生成模型基准测试

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

摘要

Support