SridBench: 画像生成モデルの科学的研究イラスト描画ベンチマーク

要旨

近年、AI駆動の画像生成技術は急速に進歩を遂げています。初期の拡散モデルは知覚的な品質を重視していましたが、GPT-4o-imageのような新しいマルチモーダルモデルは高度な推論を統合し、意味理解と構造構成を改善しています。科学イラスト生成はこの進化を象徴する例です：一般的な画像合成とは異なり、技術的な内容を正確に解釈し、抽象的なアイデアを明確で標準化された視覚表現に変換することを要求します。このタスクははるかに知識集約的で労力を要し、しばしば何時間もの手作業と専門的なツールを必要とします。これを制御可能で知的な方法で自動化することは、実用的な価値を大幅に提供するでしょう。しかし、この分野でAIを評価するためのベンチマークは現在存在しません。このギャップを埋めるため、我々は科学図表生成のための最初のベンチマークであるSridBenchを導入します。これは、13の自然科学およびコンピュータサイエンス分野の主要な科学論文からキュレーションされた1,120のインスタンスで構成され、人間の専門家とMLLMによって収集されました。各サンプルは、意味的忠実度や構造的精度を含む6つの次元に沿って評価されます。実験結果は、GPT-4o-imageのようなトップクラスのモデルでさえ、テキスト/視覚的な明瞭さや科学的正確性において人間のパフォーマンスに遅れをとっていることを明らかにしています。これらの発見は、より高度な推論駆動の視覚生成能力の必要性を強調しています。

English

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.