SridBench：图像生成模型在科研插图绘制领域的基准测试

摘要

近年来，AI驱动的图像生成技术取得了飞速进展。早期的扩散模型侧重于感知质量，而如GPT-4o-image等新型多模态模型则整合了高级推理能力，提升了语义理解和结构布局。科学插图生成领域正是这一演变的典型例证：与通用图像合成不同，它要求准确解读技术内容，并将抽象概念转化为清晰、标准化的视觉表达。这一任务知识密集度更高，且更为耗时费力，往往需要数小时的手工操作和专用工具。若能以可控、智能的方式实现其自动化，将带来巨大的实用价值。然而，目前尚缺乏评估AI在此方面表现的基准。为填补这一空白，我们推出了SridBench，首个科学图表生成基准。它包含从13个自然科学与计算机科学领域的顶尖论文中精心挑选的1,120个实例，由人类专家和多模态大语言模型共同收集。每个样本从六个维度进行评估，包括语义忠实度和结构准确性。实验结果显示，即便是GPT-4o-image这样的顶级模型，在文本/视觉清晰度和科学准确性方面也普遍存在问题，整体表现仍逊色于人类。这些发现凸显了开发更先进的推理驱动视觉生成能力的迫切需求。

English

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.